Infrastructure decisions I endorse or regret after 4 years at a startup (2024)

320 points by Meetvelde 3 days ago

It's weird that one of the reasons that you endorse AWS is that you had regular meetings with your account manager but then you regret premium support which is the whole reason you had regular meetings with your account manager.

unsnap_biceps - 12 hours ago

If you spend enough (or they think you'll spend enough), you'll get an account manager without the premium support contract, especially early in the onboarding
- necubi - 11 hours ago
  
  Or if you’re a newish startup who they hope will eventually spend enough to justify it.
rco8786 - 2 hours ago

The regret was about the cost of the premium support
dangus - 11 hours ago

As a counterpoint, I find our AWS super team to be a mix of 40% helpful, 40% “things we say are going over their head,” 20% attempting to upsell and expand our dependence. It’s nice that we have humans but I don’t think it’s a reason to choose it or not.
GCP’s architecture seems clearly better to me especially if you are looking to be global.
Every organization I’ve ever witnessed eventually ends up with some kind of struggle with AWS’ insane organizations and accounts nightmare.
GCP’s use of folders makes way more sense.
GCP having global VPCs is also potentially a huge benefit if you want your users to hit servers that are physically close to them. On AWS you have to architect your own solution with global accelerator which becomes even more insane if you need to cross accounts, which you’ll probably have to do eventually because of the aforementioned insanity of AWS account/organization best practices.
- 0xbadcafebee - 7 hours ago
  
  There's a very large gap between "seems" and reality. GCP is a huge PITA. It's not even stable to use, as the console is constantly unresponsive and buggy, the UX is insane, finding documentation is like being trapped in hell.
  Know how you find all the permissions a single user in GCP has? You have to make 9+ API calls, then filter/merge all the results. They finally added a web tool to try and "discover" the permissions for a user... you sit there and watch it spin while it madly calls backend APIs to try to figure it out. Permissions for a single user can be assigned to users, groups, orgs, projects, folders, resources, (and more I forget), and there's inheritance to make it more complex. It can take all day to track down every single place the permissions could be set for a single user in a single hierarchical organization, or where something is blocking some permission. The complexity increases as you have more GCP projects, folders, orgs. But, of course, if you don't do all this, GCP will fight you every step of the way.
  Compare that to AWS, where you just click a user, and you see what's assigned to it. They engineered it specifically so it wouldn't be a pain in the ass.
  > Every organization I’ve ever witnessed eventually ends up with some kind of struggle with AWS’ insane organizations and accounts nightmare.
  This was an issue in the early days, but it's well solved now with newer integrations/services. Follow their Well Architected Framework (https://docs.aws.amazon.com/wellarchitected/latest/framework...), ask customer support for advice, implement it. I'm not exaggerating when I say this is the best description of the best information systems engineering practice in the world, and it's achievable by startups. It just takes a long time to read. If you want to become an excellent systems engineer/engineering manager/CTO/etc, this is your bible. (Note: you have to read the entire thing, especially the appendixes; you can't skim it like StackOverflow)
  - secondcoming - 2 hours ago
    
    GCP's UI sure is dog slow. I once filled in that 'How would you rate...' prompt that sometimes appears about Instance Group management via the UI and it seems they later addressed the issue.
- danpalmer - 10 hours ago
  
  Similar to my experience with the two. We didn't have regular meetings with our GCP account manager, but they did help us and we had a technical support rep there we were in contact with sometimes. We rarely heard from anyone at AWS, and a friend had some horror stories of reporting security issues to AWS.
  Architecturally I'd go with GCP in a heartbeat. Bigquery was also one of the biggest wins in my previous role. Completely changed out business for almost everyone, vs Redshift which cost us a lot of money to learn that it sucked.
  You could say I'm biased as I work at Google (but not on any of this), but for me it was definitely the other way around, I joined Google in part because of the experience of using GCP and migrating AWS workloads to in.
- SkiFire13 - 7 hours ago
  
  > Every organization I’ve ever witnessed eventually ends up with some kind of struggle with AWS’ insane organizations and accounts nightmare.
  What are these struggles? The product I work on uses AWS and we have ~5 accounts (I hear they used to be more TBF) but nowadays all the infrastructure is on one of them and the other are for some niche stuff (tech support?). I could see how going overboard with many accounts could be an issue, but I don't really see issues having everything on one account.
  - sylens - 2 hours ago
    
    I like AWS, but Organizations was something that was retrofit onto the account model versus being part of the original design. GCP had second mover advantage in this area.
    The way to automate provisioning of new AWS accounts requires you to engage with Control Tower in some way, like the author did with Account Factory for Terraform.
    
    blitzar - 2 hours ago
    
    AWS makes the account model feel retrofit versus being part of the original design and 5 years later someone retrofit the organisations onto that before they they added 90% of the products into any square round hole they could find.
  - sleepychu - 6 hours ago
    
    We were saved by the bell when they announced the increased account limit for S3 buckets (1M buckets, now, 1k I think before).
    Just before they announced that I was working on creating org accounts specifically to contain S3 buckets and then permitting the primary app to use those accounts just for their bucket allocation.
    AWS themselves recommend an account per developer, IIRC.
    It's as you say, some policy or limitation might require lots of accounts and lots of accounts can be pretty challenging to manage.
- UltraSane - 9 hours ago
  
  Global VPCs are very nice but they feel like a single blast radius.
  - dangus - 8 hours ago
    
    Whether or not your VPC can have subnets in multiple regions is entirely unrelated to security.
    
    UltraSane - 8 hours ago
    
    I meant failure blast radius. Having isolated regions is a core part of the AWS reliability design. AWS has had entire regions fail but these failure have always been isolated to a single region. Global VPCs must rely on globally connected routers that can all fail in ways AWS regional VPCs can't.
    
    ses1984 - 7 hours ago
    
    If you need global HA to the extent that you're worried about global VPC failure modes, you're going to have to spend a lot of effort to squeeze uptime to the max regardless of where you deploy.
    Undersea cable failures are probably more likely than a google core networking failure.
    In AWS a lot of "global" things are actually just hosted in us-east-1.
    
    easton - an hour ago
    
    On the other hand, when they say something is in us-west-2 they mean it, so if another region has an outage your workloads aren't impacted unless your code is reaching out to that region.
    Guessing that's similar on the other clouds.
- wetpaws - 10 hours ago
  
  [dead]

SoftTalker - 8 hours ago

After listing dozens of infrastructure products/projects, "My general infrastructure advice is “less is better”.

That made me laugh. Yes I get that they probably didn't use all of these at the same time.

rf15 - 5 hours ago

> Startups don’t have the luxury of a DBA

but... you are spending so much on AWS and premium support... surely you can afford that

Lucasoato - 2 hours ago

Self managing a database vs getting RDS isn't an easy choice. It depends on the scale, it depends on the industry... if you're locked in already in AWS, the price difference between the bare machines vs RDS usually aren't enough to pay for another person.
If you're starting everything from scratch, you might think that going to other providers (like Hetzner) is a good idea, and it may definitely be! But then you need to set up a Site2Site VPN because the second big customer of your B2B SaaS startup uses on-premises infrastructure and AWS has that out of the box, while you need an expert networking guy to do that the right way on Hetzner.
happymellon - 2 hours ago

The last startup I was with that used AWS didn't spend anything on premium support. We were given startup credits to apply to our accounts, and they were always happy to hand out more to get us hooked.
enviwje97 - 4 hours ago

[flagged]

calmbonsai - 13 hours ago

This is the best post to HN in quite some time. Kudos to the detailed and structured break-down.

If the author had a Ko-Fi they would've just earned $50 USD from me.

I've been thinking of making the leap away from JIRA and I concur on RDS, Terraform for IAC, and FaaS whenever possible. Google support is non-existent and I only recommend GC for pure compute. I hear good things about Big Table, but I've never used in in production.

I disagree on Slack usage aside from the postmortem automation. Slack is just gonna' be messy no matter what policies are put in place.

xyzzy_plugh - 11 hours ago

Anecdotally I've actually had pretty good interactions with GCP including fast turn arounds on bugs that couldn't possibly affect many other customers.
notyourwork - 6 hours ago

Curious from you or others when FaaS isn’t possible? What criteria do you look for to decide or migrate off?
- coverj - 4 hours ago
  
  not possible: - workloads over 15m for lambda last time I checked, unsure on other providers - if you are looking to do anything stateful
  possible but not ideal/inconveniences: - cold starts can hamper latency sensitive apps (language dependant + there are things you can do) - if you have consistent traffic its not very good value for money - if you value local debugging
unethical_ban - 13 hours ago

What do you use if not slack? OPs advice is standard best practice. Respect peoples time by not expecting immediate response, and use team or function based channels as much as possible.
Other options are email of course, and what, teams for instant messages?
- jasonpeacock - 12 hours ago
  
  I’ve always that forums are much better suited to corporate communications than email or chat.
  Organized by topics, must be threaded, and default to asynchronous communications. You can still opt in to notifications, and history is well organized and preserved.
- ale42 - 5 hours ago
  
  We use self-hosted Mattermost (team version, i.e. without limits but no enterprise features like LDAP). Fine for a small team (around 40 active users here) where you can script account actions via the API, probably not fine when users become a lot more, or you might need access to the compliance functions for audit purposes, etc.
  For us the free version of Slack was insufficient, the commercial one too expensive, and anyway, given that it's a cloud-based system, it's not compliant with our internal rules for confidential information (unless we can get some specific agreement with them). On the side, there is a bit too much analytics/telemetry in the Slack client.
- jasonpeacock - 12 hours ago
  
  The bullet points for using Slack basically describe email (and distribution lists).
  It’s funny how we get an instant messaging platform and derive best practices that try to emulate a previous technology.
  Btw, email is pretty instant.
  - nottorp - 7 hours ago
    
    If you work in a team, email is limited to the people you cc: while a convo in a slack channel can have people you didn't think of jump in* with information.
    See the other point in the article about discouraging one on one private messages and encouraging public discussion. That is the main reason.
    * half a day later or days later if you do true async, but that's fine.
    
    dijit - 7 hours ago
    
    I am neutral in this particular topic, so don’t think I’m defending or attacking or anything.
    But aren’t mailling lists and distribution groups pretty ubiquitous?
    
    nottorp - 6 hours ago
    
    But - from the people you actually want to get to contribute - emails come with an expectation of a well thought out text. IMs ... less so.
    I've been working across time zones via IM and email since ... ICQ.
    I'm probably biased by that but I consider email the place for questions lists and long statuses with request for comments, and for info that I want retained somewhere. While IM is a transient medium where you throw a quickie question or statement or whine every couple hours - and check what everyone else is whining about.
    
    dijit - 6 hours ago
    
    I have now been roped into talking more about a topic I have no interest in and am completely ambivalent to… :/
    But clearly, thats cultural.
    If you keep your eyes on the linux kernel mailing you’ll see a lot of (on topic) short and informal messages flying in all directions.
    If you keep your eyes on the emails from big tech CEOs that sometimes appear in court documents; you’ll see that the way they use email is the same way that I’d use slack or an instant messenger.
    Thats likely because its the tool they have available- we have IM tools that connect us to people we need (inside the company)- making email the only place for long form content, which means its only perceived as being for long form content.
    But when people have to use something federated more often, it does seem like email is actually used this way.
  - unethical_ban - 11 hours ago
    
    I get it, email accomplishes a lot. But it "feels" like a place these days for one-off group chats, especially for people from different organizations. Realtime chat has its places and can also step in to that email role within a team. All my opinion, none too strongly held.

kstrauser - 13 hours ago

> Picking Terraform over Cloudformation: Endorse

I, too, prefer McDonald's cheeseburgers to ground glass mixed with rusty nails. It's not so much that I love Terraform (spelled OpenTofu) as that it's far and away the least bad tool I've used in the space.

orwin - 11 hours ago

Terraform/openTofu is more than OK. The fact that you can use to to configure your Cisco products as well as AWS is honestly great for us. It's also a bit like ansible: if you don't manage it carefully and try to separate as much as possible early, it starts bloating, so you have to curate early.
Terragrunt is the only sane way to deploy terraform/openTofu in a professional environment though.
- kstrauser - 8 hours ago
  
  I curse at Terraform at least once a week, usually right after I’ve discovered some weird arbitrary limitation surprising misfeature. It’s still what I reach for when I need to manage a whole organization. And compared to CloudFormation, it’s the freaking Cistine Chapel of IaC.
- xorcist - 6 hours ago
  
  I never understood this. Why not use Ansible instead, especially if you already use it? Doubly so when you have Cisco config to manage. The experience is generally so much better it's not comparable, and it is much easier to infer running state.
  - gchamonlive - 2 hours ago
    
    Ansible and terraform have some overlap, but they do tend to serve different purposes. The consequences of terraform having a state file should steer your decision.
    However, I often find ansible modules to be confusing to use. Maybe with LLMs it's now easier to draft ansible roles and maintain them, but I always had agro whenever I needed to go to the docs for something I've done many times just because the modules are that much inconsistent.
    
    xorcist - 36 minutes ago
    
    Setting aside the turing completeness of them, in practice Ansible is a complete superset of Terraform. From experience, the only times you appreciate the state file is when you have uncontrolled changes, in which case you are in for a bad time anyway.
    Ansible modules are trivial to write and more people should. Most are trivial in practice and just consists of a few underlying API calls. A dozen line snippet you fully understand is generally not a maintenance burden. A couple of thousand someone else wrote might be.
- MrDarcy - 9 hours ago
  
  We can also use expect to configure Cisco routers and AWS infrastructure, doesn’t mean we should.
walt_grata - 11 hours ago

I've been very happy using cdk for interacting with aws. Much better than terraform and the like.
- etothet - 9 hours ago
  
  I second this. I do use some terraform, but for most of our stacks, CDK has been fantastic.
nine_k - 11 hours ago

Any opinion on Pulumi?
- gouggoug - 11 hours ago
  
  Not an opinion on Pulumi specifically, but an opinion on using imperative programming languages for infrastructure configuration: don't do it. (This includes using things like CDKTF)
  Infrastructure needs to be consistent, intuitive and reproducible. Imperative languages are too unconstrained. Particularly, they allow you to write code whose output is unpredictable (for example, it'd be easy to write code that creates a resources based on the current time of day...).
  With infrastructure, you want predictability and reproducibility. You want to focus more on writing _what_ your infra should look like, less _how_ to get there.
  - nothrabannosir - 7 hours ago
    
    Couldn't disagree more.
    I have written both TF and then CDKTF extensively (!), and I am absolutely never going back to raw TF. TF vs CDKTF isn't declarative vs imperative, it's "anemic untyped slow feedback mess" vs "strong typesystem, expressive builtins and LSP". You can build things in CDKTF that are humanly intractable in raw TF and it requires far less discipline, not more, to keep it from becoming an unmaintainable mess. Having a typechecker for your providers is a "cannot unsee" experience. As is being able to use for loops and defining functions.
    That being said, would I have preferred a CDKTF in Haskell, or a typed Nix dialect? Hell yes. CDKTF was awful, it was just the least bad thing around. Just like TF itself, in a way.
    But I have little problems with HCL as a compilation target. Rich ecosystem and the abstractions seem sensible. Maybe that's Stockholm syndrome? Ironically, CDKTF has made me stop hating TF :)
    Now that Hashicorp put the kibosh on CDKTF though, the question is: where next...
  - kstrauser - 8 hours ago
    
    Thanks for saving me the trouble of writing exactly that. I want my IaC to be roughly as Turing complete as JSOJ. It’s sooo tempting to say “if only I could write this part with a for loop…” and down that path lies madness.
    There are things I think Terraform could do to improve its declarative specs without violating the spirit. Yet, I still prefer it as-is to any imperative alternatives.
  - vanviegen - 8 hours ago
    
    > Particularly, they allow you to write code whose output is unpredictable
    Is that an easy mistake to make and a hard one to recover from, in your experience?
    The way you have to bend over backwards in Terraform just to instantiate a thing multiple times based on some data really annoys me..
    
    gouggoug - 5 hours ago
    
    > Is that an easy mistake to make and a hard one to recover from, in your experience?
    If you're alone in a codebase? Probably not.
    In a company with many contributors of varying degrees of competence (from your new grad to your incompetent senior staff), yes.
    In large repositories, without extremely diligent reviewers, it's impossible to prevent developers from creating the most convoluted anti-patterny spaghetti code, that will get copy/pasted ad nauseam across your codebase.
    Terraform as a tool and HCL as a programming language leave a lot to be desire (in hindsight only, because, let's be honest, it's been a boon for automation), but their constrained nature makes it easier to reign in the zealous junior developer who just discovered OOP and insists on trying it everywhere...
  - popalchemist - 10 hours ago
    
    yes. IaC is a misnomer. IaC implementations should have a spec (some kind of document) as the source of truth; not code.
- AIorNot - 3 hours ago
  
  We used it at my last startup and I loved it but im a dev not devops guy
  I loved reading code
- orthecreedence - 8 hours ago
  
  Pulumi is superior to Terraform for my use cases. It's actually Infrastructure as Code. Terraform pretends to be, but uses a horrible config language that tries to skirt the line between programming language and config spec, and skirts it horribly. Reorganizing modules is a huge pain. I dreaded using Terraform and I spin things up and down in Pulumi all day. No contest.
  Granted, I'm a programmer, have been for a long time, so using programming tools is a no brainer for me. If someone wants to manage infra but doesn't have programming skills, then learning the Terraform config language is a great idea. Just kidding, it's going to be just as confusing and obnoxious as learning the basic skills you need in python/js to get up and running with Pulumi.
  - kstrauser - 8 hours ago
    
    I disagree with that. I think it’s satisfying to find a way to express my intent in HCL, and I don’t think I could do it as well without a strong programming background.
- x3n0ph3n3 - 10 hours ago
  
  My opinion is there are not enough good software developers doing DevOps, and HCL is simple enough and can have pretty good guardrails on it. My biggest concern is people shooting themselves in the foot because the static analysis tools available for HCL don't work with Pulumi.
  - SlightlyLeftPad - 9 hours ago
    
    It’s an unfortunate truth that good software developers aren’t crazy enough to want to do it.
MrDarcy - 9 hours ago

CDK is far better than Terraform.
- YetAnotherNick - 3 hours ago
  
  CDK is better when it works. Terraform has so many escape hatches it scales better with edge cases over time.
  There are all sort of requirements that pops up, specially in times of downtime or testing infra migration in production etc. and it's much easier to manually edit the terraform states.
- jauntywundrkind - 8 hours ago
  
  If you're any good at all at CDK, it's cdk8s is also a very solid clear & clean way to do kubernetes too. https://cdk8s.io/
  I'm trying to make the decision for where to go with my home lab, and while Pulumi and Cue look neat, cdk8s seems so predictable & has such clear structure & form to it.
  That's said the l1/l2/l3 distinction can be a brute to deal with. There's significant hidden complexity there.
  - shepherdjerred - 8 hours ago
    
    I use cdk8s for my homelab and absolutely love it. 100% recommend.
    Homelab CDKs: https://github.com/shepherdjerred/monorepo/tree/main/package...
    Script I wrote to generate types from Helm charts: https://github.com/shepherdjerred/monorepo/tree/main/package...
easterncalculus - 11 hours ago

You can honestly do a lot of what people do with Terraform now just using Docker and Ansible. I'm surprised more people don't try to. Most clouds are supported, even private clouds and stuff like MAAS.
- x3n0ph3n3 - 10 hours ago
  
  Yeah, but ansible is one of the nine circles of hell and its support for various AWS services beyond EC2 and S3 is near nonexistant.
  - Tostino - 8 hours ago
    
    I have mixed feelings about it. On my first startup, I used ansible to automate all of the manual workflows and server setup that we had done. Everything was just completely manual and in people's heads before, and translating it to ansible was a pain in the ass to say the least. I don't think it would have been any easier to translate it to something else though. It ended up working fine and we had a solid system that I could reset up our environment from scratch on a set of VPS provided by some terraform scripts. We were originally on digitalocean, and had to migrate to Azure because of acquisition BS.
    For my current startup I ended up not going a direction where I needed ansible. I've now got everything in helm charts and deployable to K8S clusters, and packaged with Dockerfiles. Not really missing ansible, but not exactly in love with K8S either. It works well enough I guess.
    
    SkiFire13 - 7 hours ago
    
    > on a set of VPS provided by some terraform scripts
    You ended up needing Terraform too for the infrastructure though. At that point why not just use Terraform?
    
    Tostino - 21 minutes ago
    
    Terraform was just for interacting with the cloud provider and spinning up the servers. Ansible was responsible for deploying all dependencies and getting the servers actually ready for use. Remember, none of this architecture was dockerized.
    I had originally used Ansible to interact with the cloud provider and do the provisioning too, but someone on the corporate infrastructure team wanted to use terraform for that instead, so they did the migration.

rco8786 - an hour ago

I think we're making a mistake by shoving all of this into the cloud rather than building tooling around local agents (worktrees, containers, as mentioned as "difficult" in the post). I think as an industry we just reach for cloud like our predecessors reached for IBM, without critical thought about what's actually the right tool for the job.

If you can manage docker containers in a cloud, you can manage them on your local. Plus you get direct access to your own containers, local filesystems and persistence, locally running processes, quick access for making environmental tweaks or manual changes in tandem with your agents, etc. Not to mention the cost savings.

b40d-48b2-979e - an hour ago

You also get all the risk of exposing your network and the cost of maintenance for your own datacenter.

yakkomajuri - 2 hours ago

> "Like most tech debt, we didn’t make this decision, we just did not not make this decision."

This is an important point.

kolja005 - 10 hours ago

>Since the database is used by everyone, it becomes cared for by no one. Startups don’t have the luxury of a DBA, and everything owned by no one is owned by infrastructure eventually.

This post was a great read.

Tangent to this, I've always found "best practices" to be a bit of a misnomer. In most cases in software and especially devops I have found it means "pay for this product that constrains the way that you do things so you don't shoot yourself in the foot". It's not really a "practice" if you're using a product that gives you one way to do something. That said my company uses a very similar tech stack and I would choose the same one if I was starting a company tomorrow, despite the fact that, as others have mentioned, it's a ton to keep in your head all at once.

- 6 hours ago

[deleted]

mettamage - 5 hours ago

As a non infra guy I'll say this. I'm curious about Linear. At my own company I vibecoded my own project management app against the JIRA API because I can't stand our version of JIRA. It's too many clicks, too many things to remember and it's unintuitive.

nicoburns - 2 hours ago

If you have the power to do so, get rid of JIRA immediately. There are like 10 competitors that are all dramatically better.
I would personally recommend https://www.shortcut.com which is very well designed, and also made some really sensible improvements over the time that we used it.
phrotoma - 3 hours ago

Baffling piece of software. It's a task manager and every time I use it I flail around for ages trying to figure out how to mark a task completed. No idea why people like it.
ubercore - 4 hours ago

Been incredibly happy with the speed, featureset, and pace of new (good) features in Linear. Our team has adopted it quite happily and it gets a ton of good use. Can fully recommend.
AIorNot - 3 hours ago

As everyone knows JIRA sucks but some perfect implementation of it exists in the ether at some company you will never work at :)
Theses days AI in doc, spec and production lifecycle means we need AI first ticket tooling - haven’t used Linear but I suspect that works far better with AI then JIRA

sylens - 3 hours ago

The part about account teams for AWS and GCP is very true in my experience. I could tell my AWS account team that I was hungry and they would offer to bring me a bagel in an hour. My GCP account team no-shows our cadence calls and somehow forgets the one question I ask them in the intervening time between our calls, which means each month I get to re-explain the issue as they pretend to escalate it again.

kaycey2022 - 12 hours ago

Feels like a minor glimpse into what's involved in running tech companies these days. Sure this list could be much simpler, but then so would the scope of the company's offerings. So AI would offer enough accountability to replace all of this? Agents juggling million token contexts? It's kind of hard to wrap my head around.

nine_k - 11 hours ago

Agents run tools, too. You can make an LLM count by the means of language processing, but it's much more efficient to let it run a Python script.
By the same token, it's more efficient to let an LLM operate all these tools (and more) than to force an LLM to keep all of that on its "mind", that is, context.
- kaycey2022 - 10 hours ago
  
  Agents are just not deterministic. You will see wierd things like in one thread an agent simply says it cannot access a CLI tool for whatever reason. You inspect the call, it worked just fine in another thread. You eventually shrug your shoulders and close the thread, pick up from another instead of having the agent flail around some obvious BS for hours and hours.
  Just because they can run tools, doesn't mean they run them reliably. Running tools is not a be all and end all of the problem.
  Amdahl's law is still in play when it comes to agents orchestrating entire business processes on their own.
  - cindyllm - 10 hours ago
    
    [dead]

wavemode - 10 hours ago

(2024)

past discussion: https://news.ycombinator.com/item?id=39313623

tomhow - 5 hours ago

Thanks! Macroexpanded...
Almost every infrastructure decision I endorse or regret - https://news.ycombinator.com/item?id=39313623 - Feb 2024 (626 comments)

nevalainen - 9 hours ago

There is a lot of "stuff" for liking to keep it simple. Great article though!

bigiain - 7 hours ago

I initially read this wrong as "Almost every infrastructure decision I make I regret after 4 years", and I nodded my head in agreement.

I've been working mostly at startups most of my career (for Sydney Australia values of "start up" which mostly means "small and new or new-ish business using technology", not the Silicon Valley VC money powered moonshot crapshoot meaning). Two of those roles (including the one I'm in now) have been longer that a decade.

And it's pretty much true that almost all infrastructure (and architecture) decisions are things that 4-5 years later become regrets. Some standouts from 30 years:

I didn't choose Macromind/Macromedia Director in '94 but that was someone else's decision I regretted 5 years later.

I shouldn't have chosen to run a web business on ISP web hosting and Perl4 in '95 (yay /cgi-bin).

I shouldn't have chosen globally colocated desktop pc linux machines and MySQL in '98/99 (although I got a lot of work trips and airline miles out of that).

I shouldn't have chosen Python2 in 2007, or even worse Angular2 in 2011.

I _probably_ shouldn't have chosen Arch Linux (and a custom/bastardised Pacman repo) for a hardware startup in 2013.

I didn't choose Groovy on Grails in 2014 but I regretted being recruited into being responsible for it by 2018 or so.

I shouldn't have chosen Java/MySQL in 2019 (or at least I should have kept a much tighter leash on the backend team and their enterprise architecture astronaut).

The other perspective on all those decisions though, each of them allowed a business to do the things they needed to take money off customers (I know I know, that's not the VC startup way...) Although I regretted each of those later, even in retrospect I think I made decent pragmatic choices at the time. And at this stage of my career I've become happy enough knowing that every decision is probably going to have regrets over a 4 or 5 year timeframe, but that most projects never last long enough for you to get there - either the business doesn't pass out and closes the project down, or a major ground up rewrite happens for reasons often unrelated to 5 year old infrastructure or architecture choices.

stroebs - 5 hours ago

The Bottlerocket issues really surprise me - not an experience I've shared even with heavy use. I use EKS with Bottlerocket + managed addons + Karpenter, and our security team is super happy that _nobody_ has access to the underlying nodes. Immutable OS is a key selling point, and Brupop "just works" to keep everything up to date without any input. Patching nodes is something I haven't had to think about in almost a year.

lightyrs - 6 hours ago

Interested to know what's changed (if anything) in the two years since this was written.

hambes - 6 hours ago

for one thing the ingress nginx is retiring[1], so they're probably revsiting alternatives, maybe even the service meshes for the new gateway api.
1: https://kubernetes.io/blog/2026/01/29/ingress-nginx-statemen...

prplfsh - 4 hours ago

I feel so many of these. LOL @ GitHub endorse-ish, more -ish every day now. Overall though seems like a pretty good hit rate.

Surprised to see datadog as a regret - it is expensive but it's been enormously useful for us. Though we don't run kubernetes, so perhaps my baseline of expensive is wrong.

neo_doom - 10 hours ago

> Regret: Not adopting an identity platform early on. I stuck with Google Workspace at the start...

I've worked with hundreds of customers to integrate IdP's with our application and Google Workspace was by far the worst of the big players (Entra ID, Okta, Ping). Its extremely inflexible for even the most basic SAML configuration. Stay far, far away.

0xbadcafebee - 7 hours ago

And it's a horrible moat. I've gotten locked out of a Google Workspace permanently because the person who set it up left, used a personal email/phone to do it, and despite us owning/controlling the domain, Google wouldn't unlock admin access to the Workspace for us, they would only delete it. Unacceptable business risk.
- Imustaskforhelp - 5 hours ago
  
  Holy moly. this is nightmare fuel.

Grimburger - 12 hours ago

> There are no great FaaS options for running GPU workloads

Knative on k8s works well for us, there's some oddities about it but in general does the job

jbmsf - 10 hours ago

Thanks. I've been meaning to write one of these for a long time, but you went into detail in a very effective, organized way.

I also reached a lot of similar decisions and challenges, even where we differ (ECS vs EKS) I completely understand your conclusions.

jmward01 - 10 hours ago

You will never agree 100% with someone else when it comes to decisions like this, but clearly there is a lot of history behind these decisions and they are a great starting point for conversations internally I think.

bob1029 - 5 hours ago

> Not using Function as a Service(FaaS) more

FaaS is almost certainly a mistake. I get the appeal from an accountant's perspective, but from a debugging and development perspective it's really fucking awful compared to using a traditional VM. Getting at logs in something like azure functions is a great example of this.

I pushed really hard for FaaS until I had to support it. It's the worst kind of trap. I still get sweaty thinking about some of the issues we had with it.

CodesInChaos - 4 hours ago

What's the issue with logging? I would have expected stdout/stderr to get automatically transferred to the providers managed logging solution (e.g. cloudwatch).
Though I never really understood the appeal of FaaS over something like Google-Cloud-Run.
- bruce343434 - 3 hours ago
  
  As a developer who spent a couple months developing a microservice using aws lambda functions:
  it SUCKS. There's no interactive debugging. Deploy for a minute or 5 depending on the changes, then trigger the lambda, wait another 5 minutes for all the logs to show up. Then proceed with printf/stack trace debugging.
  For reasons that I forgot, locally running the lambda code on my dev box was not applicable. Locally deploying the cloud environment neither.
  I wasn't around for the era but I imagine it's like working on an ancient mainframe with long compile times and a very slow printer.
  - AIorNot - 3 hours ago
    
    Lol exactly
- - 4 hours ago
  
  [deleted]

isoprophlex - 5 hours ago

> There are no great FaaS options for running GPU workloads

I love modal. I think they got FaaS for GPU exactly right, both in terms of their SDK and the abstractions/infra they provide.

rixed - 8 hours ago

Sure, let's take advices about infrastructure from that guy wo needs a tool to automate postmortems.

mlrtime - an hour ago

Can you expand? Have you never worked at a tech company that has incidents?
In which world does a large tech company exist without problems, if so how big, how many customers etc?

ttoinou - 7 hours ago

Nice but how do those services combine with each others ? How do you combine notion, slack, your git hosting, linear and your CI/CD ? If there are only URLs between each others it’s hard to link all the work together

jrjeksjd8d - 13 hours ago

I see you regret Datadog but there's no alternative - did you end up homebrewing metrics, or are you just living with their insane pricing model? In my experience they suck but not enough to leave.

stackskipton - 12 hours ago

Not author but Prometheus is perfectly acceptable alternative if you don't want to go whole Otel route.
- t-writescode - 10 hours ago
  
  Prometheus + … what? Datadog is a visualization platform, prometheus is a data gathering infrastructure.
  - stackskipton - 10 hours ago
    
    Grafana is most common one.
jpgvm - 9 hours ago

VictoriaMetrics stack. Better, cheaper, faster queries, more k8s native, etc. Easy to run with budget saved from not being on Datadog + attracts smart and observability minded engineers to your team.
velocity3230 - 6 hours ago

LGTM stack?
lelandbatey - 11 hours ago

Currently going through leaving DD at work. Many potential options, many companies trying to break in. The one that calls to me spiritually is: throw it all in Clickhouse (hosted Clickhouse is shockingly cheap) with a hosted HyperDX (logs and metrics UI) instance in front of it. HyperDX has its issues, but it's shocking how cheap it is to toss a couple hundred TB of logs/metrics into Clickhouse per month (compared to the kings ransom DD charges). And you can just query the raw rows, which really comes in handy for understanding some in-the-weeds metrics questions.
jamiemallers - 5 hours ago

"No alternative" isn't quite right anymore, though I understand the feeling. The real problem with Datadog isn't the pricing - it's that their per-host model incentivizes you to care about infrastructure topology rather than user-facing behavior. You end up with 10,000 dashboards and still can't answer "is checkout broken right now?"
The open source stack has gotten genuinely viable: Prometheus/VictoriaMetrics for metrics, Grafana for viz, and OpenTelemetry as the collection layer means you're not locked into anyone's agent. The gap used to be in correlation - connecting a metric spike to a trace to a log line - but that's narrowed significantly.
The actual hard part of leaving DD isn't technical, it's organizational. DD becomes load-bearing for on-call runbooks, alert routing, and team muscle memory. Migration is less "swap the backend" and more "retrain your incident response."
If you're evaluating: the question I'd ask isn't "which vendor has the best dashboards" but "can I get from alert to root cause in under 5 minutes with this tool?" That's the metric that actually correlates with MTTR, and it's where most monitoring setups (including expensive ones) fail.

mwcampbell - 13 hours ago

I disagree on Kubernetes versus ECS. For me, the reasons to use ECS are not having to pay for a control plane, and not having to keep up with the Kubernetes upgrade treadmill.

x3n0ph3n3 - 10 hours ago

This. k8s is primarily resume driven development in most software shops. Hardly any product or service really needs its complexity.
- karolist - 7 hours ago
  
  To replace Kubernetes, you inevitably have to reinvent Kubernetes. By the time you build in canaries, blue/green deployments, and rolling updates with precise availability controls, you've just built a bespoke version of k8s. I'll take the industry standard over a homegrown orchestration tool any day.
  - secondcoming - 2 hours ago
    
    We've used ECS back when we were on AWS, and now GCE.
    We didn't have to invent any homegrown orchestration tool. Our infra is hundreds of VMs across 4 regions.
    Can you give an example of what you needed to do?
- jauntywundrkind - 8 hours ago
  
  The amount of tools and systems here that work because of k8s is signficiant. K8s is a control plane and an integration plane.
  I wish luck to the imo fools chasing the "you may not need it" logic. The vacuum that attitude creates in its wake demands many many many complex & gnarly home-cooked solutions.
  Can you? Sure, absolutely! But you are doing that on your own, glueing it all together every step of the way. There's no other glue layer anywhere remotely as integrative, that can universally bind to so much. The value is astronomical, imho.

robszumski - 13 hours ago

Thanks for sharing, really helpful to see your thinking. I haven't fully embraced FaaS myself but never regretted it either.

Curious to hear more about Renovate vs Dependabot. Is it complicated to debug _why_ it's making a choice to upgrade from A to B? Working on a tool to do app-specific breaking change analysis so winning trust and being transparent about what is happening is top of mind.

When were you using quay.io? In the pre-CoreOS years, CoreOS years (2014-2018), or the Red Hat years?

zem - 13 hours ago

I would love to read more about the pros and cons of using a single database, if anyone has pointers to articles

stackskipton - 12 hours ago

SRE here who has dealt with this before.
Everything in article is excellent point but other big point is schema changes become extremely difficult because you have unknown applications possibly relying on that schema.
It's also at certain point, the database becomes absolutely massive and you will need teams of DBAs care and feeding it.
- jghn - 14 minutes ago
  
  This is true. But at the same time people need to understand that most companies will never hit that certain point. It's a matter of if, not when.
  Everyone tries to plan for a world where they've become one of the hyperscalers. Better to optimize for the much more likely scenarios.
- x3n0ph3n3 - 10 hours ago
  
  Not only will you need a team of DBAs caring for it, but you'll never be able to hire them.
  - hobs - 9 hours ago
    
    No organization I have seen prioritizes a DBA's requirements, concerns, or approach. They certainly don't pay them enough to deal with that bullshit, so I was out.
fidgetstick - 11 hours ago

Martin Fowler:
https://www.enterpriseintegrationpatterns.com/patterns/messa...
rawgabbit - 10 hours ago

The things that impact the most are locking/blocking, data duplication (ghosting due to race conditions), and poor performance. The best advice is RTFM the documentation for your database; yes, it is a lot to digest that is why DBAs exist. Most of these foot guns are due to poor architecture. You have to imagine multiple users/processes are literally trying to write to the same record at the same time; when you realize this, a single table with simple key-values is completely inadequate.
sgarland - 10 hours ago

Pro: every team probably needs user information, so don’t duplicate it in weird ways with uncertain consistency.
Con: it’s sadly likely that no one on your staff knows a damn thing about how an RDBMS works, and is seemingly incapable of reading documentation, so you’re gonna run into footguns faster. To be fair, this will also happen with isolated DBs, and will then be much more effort to rein in.
brandmeyer - 10 hours ago

They are very similar to the pros and cons of having a monorepo. It encourages information sharing and cross-linkage between related teams. This is simultaneously its biggest pro and its biggest con.

mlrtime - an hour ago

Pagerduty: They haven't yet hit that point where PD doubles the prices for them. Or they don't have everyone on the platform, it will be their next Datadog (too expensive)

AIorNot - 3 hours ago

Its insane how many SaaS solutions are needed piecemeal to run a company these days - just listing everything out like that made it apparent

YetAnotherNick - 3 hours ago

And it's just infra, tech stack comes after it.

weedhopper - 13 hours ago

Great post. I even wouldn’t mind more details, especially about datadog, or as others pointed out, the kind of contradiction with aws support.

0xbadcafebee - 12 hours ago

Using GCP gives me the same feeling as vibe-coded source code. Technically works but deeply unsettling. Unless GCP is somehow saving you boatloads of cash, AWS is much better.

RDS is a very quick way to expand your bill, followed by EC2, followed by S3. RDS for production is great, but you should avoid the bizarre HN trope of "Postgres for everything" with RDS. It makes your database unnecessarily larger which expands your bill. Use it strategically and your cost will remain low while also being very stable and easy to manage. You may still end up DIYing backups. Aurora Serverless v2 is another useful way to reduce bill. If you want to do custom fancy SQL/host/volume things, RDS Custom may enable it.

I'm starting to think Elasticache is a code smell. I see teams adopt it when they literally don't know why they're using it. Similar to the "Postgres for everything" people, they're often wasteful, causing extra cost and introducing more complexity for no benefit. If you decide to use Elasticache, Valkey Serverless is the cheapest option.

Always use ECR in AWS. Even if you have some enterprise artifact manager with container support... run your prod container pulls with ECR. Do not enable container scanning, it just increases your bill, nobody ever looks at the scan results.

I no longer endorse using GitHub Actions except for non-business-critical stuff. I was bullish early on with their Actions ecosystem, but the whole thing is a mess now, from the UX to the docs to the features and stability. I use it for my OSS projects but that's it. Most managed CI/CD sucks. Use Drone.io for free if you're small, use WoodpeckerCI otherwise.

Buying an IP block is a complicated and fraught thing (it may not seem like it, but eventually it is). Buy reserved IPs from AWS, keep them as long as you want, you never have to deal with strange outages from an RIR not getting the correct contact updated in the correct amount of time or some foolishness.

He mentions K8s, and it really is useful, but as a staging and dev environment. For production you run into the risk of insane complexity exploding, and the constant death march of upgrades and compatibility issues from the 12 month EOL; I would not recommend even managed K8s for prod. But for staging/dev, it's fantastic. Give your devs their own namespace (or virtual cluster, ideally) and they can go hog wild deploying infrastructure and testing apps in a protected private environment. You can spin up and down things much easier than typical AWS infra (no need for terraform, just use Helm) with less risk, and with horizontal autoscaling that means it's easier to save money. Compare to the difficulty of least-privilege in AWS IAM to allow experiments; you're constantly risking blowing up real infra.

Helm is a perfectly acceptable way to quickly install K8s components, big libraries of apps out there on https://artifacthub.io/. A big advantage is its atomic rollouts which makes simple deploy/rollback a breeze. But ExternalSecrets is one of the most over-complicated annoying garbage projects I've ever dealt with. It's useful, but I will fight hard to avoid it in future. There are multiple ways to use it with arcane syntax, yet it actually lacks some useful functionality. I spent way too much time trying to get it to do some basic things, and troubleshooting it is difficult. Beware.

I don't see a lot of architectural advice, which is strange. You should start your startup out using all the AWS well-architected framework that could possibly apply to your current startup. That means things like 1) multiple AWS accounts (the more the better) with a management account & security account, 2) identity center SSO, no IAM users for humans, 3) reserved CIDRs for VPCs, 4) transit gateway between accounts, 5) hard-split between stage & prod, 6) openvpn or wireguard proxy on each VPC to get into private networks, 7) tagging and naming standards and everything you build gets the tags, 8) put in management account policies and cloudtrail to enforce limitations on all the accounts, to do things like add default protections and auditing. If you're thinking "well my startup doesn't need that" - only if your startup dies will you not need it, and it will be an absolute nightmare to do it later (ever changed the wheels on a moving bus before?). And if you plan on working for more than one startup in your life, doing it once early on means it's easier the second time. Finally if you think "well that will take too long!", we have AI now, just ask it to do the thing and it'll do it for you.

zbentley - 11 hours ago

> Do not enable container scanning, it just increases your bill, nobody ever looks at the scan results.
God I wish that were true. Unfortunately, ECR scanning is often cheaper and easier to start consuming than buying $giant_enterprise_scanner_du_jour, and plenty of people consider free/OSS scanners insufficient.
Stupid self inflicted problems to be sure, but far from “nobody uses ECR scanning”.

dangoodmanUT - 10 hours ago

> There are no great FaaS options for running GPU workloads, which is why we could never go fully FaaS.

modal.com???

gnarbarian - 8 hours ago

given enough time you may regret every single one of them.

ink_13 - 11 hours ago

(2024)

Just FYI article is two years old

gib444 - 6 hours ago

Infra guys doing DBA is a nightmare in my experience (usually clueless and it gets loved less than more sexy parts of infra). Devs too

Hire a DBA ASAP. They need to reign in also the laziness of all other developers when designing and interacting with the DB. The horrors a dev can create in the DB can take years to undo

nijave - 2 hours ago

I'm a little afraid to say it but LLMs are getting quite good at query optimization. They can also read slow query logs and use extensions like pg_stat_statements
Doesn't necessarily prevent a terrible schema but it's become a lot easier to fix abomination queries at least
phrotoma - 3 hours ago

As an infra person I couldn't agree more. Get an expert in there. DB's are their own universe of complexity and deserve dedicated attention.

piokoch - 5 hours ago

I've just look out of curiosity on Appsmith, as the author endorsed this tool as some admin panel builder. I had to double check the name, as right now this is, surprise, surprise, AI powered application builder...

I used to use Replit for educational purposes, to be able to create simple programs in any language and share them with others (teachers, students). That was really useful.

Now Replit is a frontend to some AI chat that is supposed to write software for me.

Is this jumping into AI bandwagon everywhere a new trend? Is this really needed? Is this really profitable?

hare2eternity - 2 hours ago

Just about anyone who aspires to raise capital in the current market is making themselves out to be AI. Give it a couple of years and we'll be onto the next craze. By that time I should have migrated my application off the blockchain into the metaverse.

themafia - 9 hours ago

> “This EC2 instance type running 24/7 at full load is way less expensive than a Lambda running”.

For the same amount of memory they should cost _nearly_ identical. Run the numbers. They're not significantly different services. Aside from this you do NOT pay for IPv4 when using Lambda, you do on EC2, and so Lambda is almost always less expensive.

nijave - 2 hours ago

I'm curious how that plays out when you factor in other infrastructure components like DB and load balancers.
On Lambda, load balancing is handled out of the box but you may need to introduce things like connection poolers for the DB you could have gotten away without on EC2
Think it also depends if you're CPU or memory constrained. Lambda seemed more expensive for CPU heavy workloads since you're stuck with certain CPU:mem ratios and there's more flexibility on EC2 instance types

MaXtreeM - 6 hours ago

Previous discussion (626 comments): https://news.ycombinator.com/item?id=39313623

Adexintart - 2 hours ago

[flagged]

ohyoutravel - 2 hours ago

Why make these slop bots? This account is 7 minutes old and has already posted paragraph-long AI-generated comments in several places.