Good system design
seangoedecke.com934 points by dondraper36 5 days ago
934 points by dondraper36 5 days ago
> I’m often alone on this. Engineers look at complex systems with many interesting parts and think “wow, a lot of system design is happening here!” In fact, a complex system usually reflects an absence of good design.
For any job-hunters, it's important you forget this during interviews.
In the past I've made the mistake of trying to convey this in system design interviews.
Some hypothetical startup app
> Interviewer: "Well what about backpressure?"
>"That's not really worth considering for this amount of QPS"
> Interviewer: "Why wouldn't you use a queue here instead of a cron job?"
> "I don't think it's necessary for what this app is, but here's the tradeoffs."
> Interviewer: "How would you choose between sql and nosql db?"
> "Doesn't matter much. Whatever the team has most expertise in"
These are not the answers they're looking for. You want to fill the whiteboard with boxes and arrows until it looks like you've got Kubernetes managing your Kubernetes.
(For context, I've conducted hundreds of system design interviews and trained a dozen other people on how to do them at my company. Other interviewers may do things differently or care about other things, but I think what I'm saying here isn't too far off normal.)
I think three things about what you're saying:
1. The answers you're giving don't provide a lot of signal (the queue one being the exception). The question that's implicitly being asked is not just what you would choose, but why you would choose it. What factors would drive you to a particular decision? What are you thinking about when you provide an answer? You're not really verbalizing your considerations here.
A good interviewer will pry at you to get the signal they need to make a decision. So if you say that back-pressure isn't worth worrying about here, they'll ask you when it would be, and what you'd do in that situation. But not all interviewers are good interviewers, and sometimes they'll just say "I wasn't able to get much information out of the candidate" and the absence of a yes is a no. As an interviewee, you want to make the interviewer's job easy, not hard.
2. Even if the interviewer is good and does pry the information out of you, they're probably going to write down something like "the candidate was able to explain sensibly why they'd choose a particular technology, but it took a lot of prodding and prying to get the information out of them -- communications are a negative." As an interviewee, you want to communicate all the information your interviewer is looking for proactively, not grudgingly and reluctantly. (This is also true when you're not interviewing.)
3. I pretty much just disagree on that SQL/NoSQL answer. Team expertise is one factor, but those technologies have significant differences; depending on what you need to do, one of them might be way better than the other for a particular scenario. Your answer there is just going to get dinged for indicating that you don't have experience in enough scenarios to recognize this.
+1 on the signal. A great candidate won't need you to pry further than asking about backpressure, they'll explain WHY it's not necessary for the qps, what qps it would start becoming necessary, and how they would build it into their design down the line if the service takes off.
One of the things I tell people preparing for system design interviews is the more senior you are, the more you need to drive the interview yourself, knowing when to go deep, what to go deep on, and how to give the most signal to the interviewer.
As the interviewee I'd use such a question to demonstrate my knowledge of the topic. For this question I'd point out that thread-per-CPU w/ async I/O designs w/ maximum live connections and clever connection pool acceptance & eviction policies, and limited buffering, together intrinsically limit oversubscription, but I would still talk extensively about health monitoring and external circuit breaking, as well as the use of 429/whatever and/or flow control (where available) at the protocol level to express backpressure. I would then use this to harp on the evils of thread-per-client designs, and also why they happen, as well as the various alternatives.
Make the interviewer tell you when they've had enough and change topics.
I beg you to write an article expanding on this points. I'll pay to read this article.
Really!
I've written about this in my comments here...
Here's a brief summary:
- typical thread-per-client programming is terribly wasteful because it needs large stacks that must be able to grow (within reason), and this leads programmers to smear client state all over the stack, which then means that the memory and _cache_ footprint of per-client state is huge even though the state is highly compressible, and this is what reduces efficiency when you want to C10K (serve 10,000 clients), and this is what led to C10K techniques in the 90s
- you can most highly compress said per-client program state by using continuation passing style (CPS) async I/O programming, either hand-coded or using modern async functions in languages that have them -- this approach tends to incentivize the programmer to compress program state into a much smaller structure than an execution stack, which therefore greatly reduces the per-client memory and _cache_ footprint of the program
Note that reducing the memory and cache footprint of a program also reduces its memory bandwidth footprint, and increases cache locality and reduces latency. This means you can server more clients with the same hardware. THIS is the point of C10K techniques.
All other techniques like fibers and green threads sit on the spectrum from hand-coded CPS async I/O programming to thread-per-client sequential programming. You get to pick how efficient your code is going to be.
Now, when you apply C10K techniques you get to have one thread-per-CPU -- look ma'! no context switches -- which also improves latency and efficiency, naturally. But there's another neat thing to thread-per-CPU: it becomes easier to manage the overall health of the service, at least per-CPU, because you can now manage all your connected clients, and so you can have eviction policies for connections, and admittance policies too. In particular you can set a maximum number of connections, which means that you can set maxima that only slightly oversubscribe the hardware's capabilities.
Otherwise [and even if you do the thread-per-CPU thing, though it's less important to have circuit breakers if you do] you must have some way to measure the health of your service, and you need to monitor it, and you need your monitor to be able to "break the circuit" by telling your service to start rejecting new work. This is where HTTP status 429 and similar come into play -- it's just a way to express backpressure to clients, though flow control will also do, if you have that available to exercise. You'll still need to be able to monitor load, latencies, and throughput for thread-per-CPU services, naturally, so you know when you need to add HW. And of course you'll want to build services you can scale horizontally as much as possible so that adding hardware is easy, though too you need to be able to find and stop pathological clients (dealing with DoS and DDoS almost always requires components external to your services).
Make sure all middleware can respond appropriately to backpressure, including having their own circuit breakers, and you have a pretty resilient system -- one that under extreme pressure will be able to shed load and continue to function to some degree.
You'll need to be able to express client priorities (so that you can keep servicing part of your load) and quotas (so that pathological high-priority clients don't DoS you).
There's much more to say, naturally.
BTW, I checked today and LLMs seem to know all this stuff, and more than that they'll be able to point you to frameworks, blogs, and lots of other docs. That said, if you don't prompt them to tell you about the thread-per-CPU stuff, they won't.
Keep in mind that C10K techniques are expensive in terms of developer time, especially for junior developers.
There’s also a systems level rationale to this. Without good isolation, you’ll get a feedback loop: threads start to step on each other’s toes. This leads to slower response times. Which, at a given request pressure, leads to more parallel threads. Which slows them down even more. If there’s a brief peak in pressure, that drops the response time below a critical point, such a system will never recover and you’ll get a server furiously computing without an apparent reason only to behave normally after a restart.
Yes, and thus circuit breakers. By sizing offered capacity to some factor of actual capacity you can limit the effects of too much demand to causing backpressure naturally (rejecting requests) instead of timeouts and retries. This then allows you some level of access -- such as to your health and diagnostics end-points, because CPU usage doesn't become so high that you can't even run those.
Most systems cap the size of their thread pool and put excess requests into a queue.
Good summary of the theory, but the weird thing is that every time I’ve rewritten code to use async the total throughout went down by about 10%… which is what I estimate is the overheads introduced by the compiler-generated async state machinery.
I’m yet to see a convincing set of A/B comparisons from a modern language. My experiences don’t line up with the conventional wisdom!
That could be because you're still smearing state on the stack? With async functions one can do that, and so you still have stacks/fibers/threads, and so you've not gained much.
With a CPS approach you really don't have multiple stacks.
Oh, and relatedly the functional core, imperative shell (FCIS) concept comes in here. The imperative shell is the async I/O event loop / executor. Everything else is functional state transitions that possibly request I/O, and if you represent I/O requests as return values to be executed by the executor, then you can have those state transitions be functional. The functional state transition can use as much stack as it wants, but when it's done the stack is gone -- no stack use between state transitions.
Now naturally you don't want state transitions to have unbounded CPU time, but for some applications it might have to be the case that you have to allow it, in which case you have problems (gaaah, thread cancellation is such a pain!).
The point of FCIS is to make it so it's trivial to test the state transitions because there is nothing to mock except one input, one state of the world, and check the output against what's expected. The "imperative shell" can also be tested with a very simple "application" and setup to show that it works w/o having to test the whole enchilada with complex mockup setups.
What you're describing is how the interview process _is_ disconnected from the actual needs, and how it's good to literally "play the game" to get in.
But on the other side, that kind of interview process is itself also a signal candidates might take to avoid playing the game, knowing that most (not all, of course) companies probing for the wrong signals during the interview process are indicative of how they do function as a whole.
(been in both types of companies, and in both sides of the table)
OP has stated that the system design interview exists as a way to performatively answer a set of questions that don't relate to actually being a good system designer. In your response, you have claimed that their proposed truthful answers don't give a lot of 'signal' and that you would prefer candidates to engage with the performative nature of the process. This is true - but is not an argument against the OP's claim, which is that reciting a set of facts about system design in an interview != being a good system designer.
All that “signal” nonsense can be parroted by both an LLM and someone who read “how to pass system interviews”. Yea, great “signal”.
Not really, not in live, oral interviews.
Though I once had a case of the person we thought we were hiring and the person we got being different people. The fix for that is to always have one final in-person interview.
This goes back to "interviews go both ways". All those answers you gave are very reasonable and if I was your interviewer I'd pass you with flying colors. On the other hand if you're interviewing at a place that doesn't pass you with flying colors for those responses, that really says more about them than it does about you and may not be a great place to work.
But to your point, many times one interviews for a job they don't really have the luxury of getting rejections and need to land somewhere fast so they can keep paying the mortgage. So while yes interviewing is a two way street, there's still quite a bit of calibration to make sure you land on the other person's side of the street so to speak.
If I was your interviewer, I would: respect your answers a lot, not be able to check off anything on my rubric, try to explain this in the debrief, get told we have to stick to the rubric to counter bias, and then watch while they pass on you for someone who decided to play architecture jenga instead. I would potentially even consider emailing you to apologize later, then not do it because I'd probably get in trouble for exposing us to liability or something because apologizing can be construed as admission of guilt.
If a candidate doesn’t ask clarifying questions that lead them to an understanding of QPS, storage requirements, and throughput considerations, that’s a mark against.
At that point, if you want to see them design a distributed system with all the bells and whistles, you should stop them, tell them the kind of traffic they need to handle, then let them go again.
If they persist in designing a system that cannot handle the specified load, they have probably failed the interview.
The problem with this is people seem to have mismatched understandings of what a single system can handle. e.g. my 8 year old quad core i5 desktop with a bit of batching optimization can handle 5 digit requests per second with 15 ms p99 with some nontrivial application logic doing several joins. I don't think I've tried that same benchmark on a modern minipc, but I expect it should be similar. That's well above what most companies will ever need to handle. Visa advertises they can process ~70k tps worldwide.
Last time I interviewed I was asked about designing a system to handle 10s of thousands of events per minute, and if you thought about the problem a little you'd realize most of them didn't require real work to be done. I answered something along the lines of "you don't need to do anything special. Just normal postgres/mysql usage can handle more than that on a laptop". After I got hired I learned the rubric had some expected answers about queues (e.g. Kafka) in it. No idea why still.
Because web devs are so used to terrible design, poorly-optimized DB schemas, and networked storage latency that they have no idea what a single server (or indeed, a humdrum desktop) is capable of.
Like when I inform teams complaining of “slow queries” that the DB is executing them in sub-msec time. No idea what the rest of your stack is doing, but good luck with figuring that out - it ain’t me.
“Prove that you can apply solutions to yesterday’s problems today.” is a good strategy except in industries where today is exponentially different to yesterday.
I’m also going to need a dollar value on your data and a list of consequences. We will spend our allotted time together in Excel.
I’ve interviewed dozens of people and while I rarely do system design questions and our process isn’t nearly as check-all-the-boxes, it’s funny how accurate your comment still is. Near the later stages especially, politics starts coming in.
Exactly, it would only work if you have enough sway with your boss and the willingness to take responsibility for the hire
If I were the interviewer, I'd try to adjust the problem statement with some hypotheticals to tease out their depth of knowledge:
> "That's not really worth considering for this amount of QPS"
"What if Michael Jackson dies and your (search|news|celebrity gossip) service gets a spike in traffic way beyond the design parameters? How would you anticipate and mitigate such an event?"
(Extra points if the answer is not necessarily backpressure but they start talking about DDoS mitigation, outlier detection, caching or serving static results from extremely-common queries, spinning up new capacity to adjust to traffic spikes, blackholing traffic to protect the overall service, etc.)
> Interviewer: "Why wouldn't you use a queue here instead of a cron job?" "I don't think it's necessary for what this app is, but here's the tradeoffs."
"What if you have a subset of customers that demand faster responses than a cron job can provide?"
(And then that can become a discussion about splitting off traffic based on requirements, whether it's even worth adding the logic to split traffic vs. just using a queue for everyone, perhaps making direct API requests without either a queue or cron job for requests from just those customers, relying on the fact that they are not numerous or these requests are infrequent to trade capacity for latency, etc.)
> How would you choose between sql and nosql db?"
I would've expected the candidate to at least be able to talk about indexing, tradeoffs of joining in the DB vs. in the application, schema migrations and upgrades, creating separation between data-at-rest vs. data-in-flight, etc. If they can't do that and just handwave away as "whatever the team is most comfortable with", that's a legit hole in their knowledge. Usually you ask system design interviews of senior candidates that will be deciding on architecture and, if not hiring out the team directly, providing input to senior managers who will be hiring, so you can swap out the team nearly as easily as swapping out the architecture.
Exactly this. I don’t want someone who will design complex, bloated systems, but I DO want them to be able to articulate tradeoffs and reasons why various components might be useful.
>I would've expected the candidate to at least be able to talk about indexing, tradeoffs of joining in the DB vs. in the application, schema migrations and upgrades, creating separation between data-at-rest vs. data-in-flight, etc.
The problem is that many of these trade-offs only applied to older databases. The more relevant axis is about how distributed the db is, the replication type etc.
> that really says more about them than it does about you and may not be a great place to work.
If a really good "tech" engineer ruled out all the places that are bad at interviewing, they would probably be unemployed.
You have to look past bad interviewing practice, to some degree.
> there's still quite a bit of calibration to make sure you land on the other person's side of the street so to speak.
Exactly. But if they try to Leetcode you, you have to decide whether you have any self-respect at all, or you're all just playing house together.
This is awful advice. Simple and elegant design does not start with dismissing potential problems.
Those questions are all prompts to have a discussion in lieu of tech trivia hour. Those responses do not demonstrate wisdom, they reveal a lack of maturity. It's not the interviewers fault you refuse to be interviewed.
I agree, the responses give the vibe of "your questions are dumb and I'm too smart to waste the effort to engage with them." If you don't want the job, then don't interview!
Yes, and this is exactly why LinkedIn-driven development exists in the first place. Listing a million technologies looks much more impressive on paper to recruiters than describing how you managed to only use a modular monolith and a single Postgres instance to make everything work.
As well as the “two-way street” point made in a sibling comment, I feel like a good interviewer would say “this is great, I would keep it simple too, but I am testing your knowledge of $thing right now.” If the person won’t stop talking about the wrong thing, that’s a bad sign of course.
Do you _want_ to work in these places? In my experience, if they expect you to run kube using kube in the interview, thats exactly what they do in their ststems as well.
These are the places that actually pay well.
There's another reason for that. Deep in my heart, I would love to be part of a team that works on truly data-intensive applications (as Martin Kleppmann would call them) where all the complexity is justified.
For example, I am more of the "All you need is Postgres" kind of software engineer. But reading all those fancy blog posts on how some team at Discord works with 1 trillion messages with Cassandra and ScyllaDB makes me envious.
Also, it seems that to be hired by such employers you need to prove that you already have such experience, which is a bit of a catch-22 situation.
I feel like the phrase "all you need is Postgres" has the (often unspoken) continuation of "until you actually get to a trillion messages".
In other words, the developers you're envious of didn't start with Cassandra and ScyllaDB, they started with the problem of too many messages. That's not an architectural choice, that's product success.
Absolutely. To put it differently, unfortunately not everyone has a chance to be part of a product's organic evolution from "all we need is Postgres" to "holy crap, we're a success, what is Cassandra by the way?"
As a data point, I've been at two data-intensive startups where they eventually needed to pull (some) of their table-like data out of postgres, and for both that was past a $100MM valuation.
This varies by domain of course, but non-postgres solutions are generally built for very specific problems – they're worse than postgres at everything except one or two cases.
Only places that are making good money can afford to have overengineering.
Overengineering is more prevalent the more money a company makes and companies who overengineers will pay good money to keep the overengineering working.
Something about my old CTO and VP of Eng I respected is they were still technical enough to call out this kind of thing. For as big as that company was they really held down complexity and overengineering to a real minimum.
Unfortunately the rest of the executive has leaned on them so hard about AI boosting productivity they aren’t able to avoid thst becoming a mess
It is a shame that so many companies try to scale by just hiring a lot of people, the more people you have in a single project the more overengineering you will end up with.
Some of it is consequence of managing so many individual contributors, I still believe a lot of companies use microservice stuff as a way to scale to more teams than to more scalability/reliability/observability.
Some of it is just people coming up with clever solutions (and leaving after the fact) and a lot from resume-driven development.
> These are not the answers they're looking for.
These ARE the answers we are looking for. As the system design interview (I’ve done hundreds) I want you to start with these answers then we can layer on complexity if you’ve solved the problem and there’s time left to go into navel gazing mode.
Seeing the panic slowly build in mid-level engineers’ eyes as it dawns on them that not every problem can be solved by caching is pretty fun too. “Ok cool you’ve cached it there, now how do you fill the cache without running into the same performance issue?”
> I want you to start with these answers then we can layer on complexity if you’ve solved the problem and there’s time left to go into navel gazing mode
Exactly. Part of the interview is explaining when and why these techniques are necessary as part of demonstrating your understanding.
If the candidate gives non-answers like “I don’t think it matters because you’re a startup” or “I’d just use whatever database I’m comfortable with” that’s not demonstrating knowledge at all. That’s dismissing the question in a way that leaves the interviewer thinking you don’t have that knowledge, or you don’t take their problems seriously enough to put thought into them. There is a type of candidate who applies to startups because they think nothing matters and they can YOLO anything together for a few years before moving on to the next job, and those are just as bad as the super over-engineering candidates.
The interview is your chance to show you know the topics and when to apply them, not the time to argue that the startup shouldn’t care about such matters.
> The interview is your chance to show you know the topics and when to apply them, not the time to argue that the startup shouldn’t care about such matters.
A good way to answer these, I think, is some version of ”We probably won’t run into these issues at the scale we’re talking about, but when we run into A, B, C problems, we can try X, Y, Z solutions.”
This shows that you’re making a conscious tradeoff and know when the more complex solutions apply. Extra points if you can explain specifically how you’ll put measures in place to know when A, B, C happened and how you would engineer the system such that adding X, Y, Z is easy.
Also it looks amazing if you’re aware that vertical scaling can buy you a lot of time for comparably little money these days. Servers get up to 128 CPUs with 64TB of RAM on one machine :)
Right, and you might be small in $year but presumably you expect to grow and they don’t want to replace the team because they can’t think how to operate in any other circumstances.
> Part of the interview is explaining when and why these techniques are necessary as part of demonstrating your understanding.
The slightly altered "explain when and why these techniques are *not* necessary" is much less appreciated.
> I want you to start with these answers then we can layer on complexity if you’ve solved the problem and there’s time left to go into navel gazing mode.
Do you tell people this explicitly? If so, good on you; if not, please start! I think one of the biggest problems with interviews these days is misaligned expectations, particularly interviewees coming in assuming that what's desired is immediate evidence that they're so experienced in solving FAANG-scale problems that it's their default mode.
I believe even at FAANG-like companies, only a lucky minority is involved at that level of scale. Most developers just use the available infrastructure and tools without working on the creation of S3 or BigTable.
This famous blog post [0] suggests that the default behaviour at Google at least is for everything to deal with massive scale. Doesn't mean everyone is involved in creating massive-scale infrastructure like S3 or BigTable, but it does mean using that kind of infrastructure from the start
[0] https://www.lesswrong.com/posts/koGbEwgbfst2wCbzG/i-don-t-kn...
> Do you tell people this explicitly?
Yes and no. I give them rough scale numbers to design for. Part of the interview is knowing why I’m telling you this.
At the level where this matters, the skill to figure it out from context is important. You aren’t the guy converting spec to code. You’re the spec maker.
I agree, but I think my point is that the interview context and expectations can differ radically different from the role context, depending on the interviewer. If the expectation of the interviewer is that the interviewee should be asking questions to determine scale needs, then they should be explicit about that. For all the interviewee knows, you're going to ding them and ultimately fail them for asking too many questions and not exhibiting knowledge and experience.
> For all the interviewee knows, you're going to ding them and ultimately fail them for asking too many questions and not exhibiting knowledge and experience.
I start the interview with “I am here in the role of PM and co-engineer so you can bounce ideas off of me and ask any questions”
Stakeholders won’t start their asks with “Please ask me questions to make sure you’re building the right thing”. Asking clarifying questions is a baseline expectation of the role
This also happens because plenty of candidates learn the buzzwords and patterns without understanding the trade-offs and nuances. With a competent enough interviewer, the shallowness of knowledge can be revealed immediately.
Identifying candidates who repeat buzzwords without understanding tradeoffs is easy. It’s part of the questioning process to understand the tradeoffs.
The problem with the comment above is that it’s not discussing tradeoffs at all. It’s just jumping to conclusions and dodging any discussion of tradeoffs.
If you answer questions like that, it’s impossible to tell if the candidate is being wise or if they’re simply BSing their way around the topic and pretending to be smart about it, because both types of candidates sound the same.
It’s easy to avoid this problem by answering questions as asked and mentioning tradeoffs. Trying to dismiss questions never works in your favor.
Yes, I would probably phrase it like this. "Under the current load, I would go super simple and use X, which can work fine long enough until it doesn't. And then we can think about horizontal scaling and use Y and Z". Then proceed with a deeper discussion of Y and Z, probably.
After all, interviewing and understanding what your interviewer expects to hear is also a valuable skill (same as with your boss or client).
Even better would be to clarify under the current load and if reasonably expected future load is similar, I would use X for Y reasons.
Sometimes the “trick” is in todays load is not tomorrows
You're equating simplicity of the design with simplicity of the problem.
It's good not to over engineer, over engineering can be a cause of unneeded complexity, but when complexity is warranted the ability to solve for it simply is also needed.
More importantly though, you haven't explained or rationalized why?
It's not needed for this QPS? Oh ya? Why not? What's your magic threshold? When would it be needed? How do you plan for the team to know that time is approaching? If it's needed later how would you retrofit it? Is that going to be a simple addition? How do you know the max QPS won't be too high and that traffic won't be spiky? What if a surprise incident occurred that caused the system to overload, how would your design, without backpressure, handle that, how would you mitigate and recover?
In system design there's no real right answer, as an interviewer you're looking for the candidate to demonstrate their ability to identify the point of concerns, reason through the possibilities, explain their decisions and trade offs, and so on.
I recently had an interview like this. Felt like half the answers I gave were of the form, “You can do scaling/sharding/partitioning thing X here, but once again, for an internal app I’d try really hard to avoid doing any of that”. If you’re interviewing with capable, experienced developers, they’ll appreciate that answer (at least, I got the offer on this one!)
Louder for the back.
It’s like people crave complexity because it makes them, indispensable? Like if you’re the only one who knows how the billing reconciliation service works, they couldn’t possibly fire you?
They will.
Being pragmatic is something I look for in engineers. So long as they understand where to draw the line (and use a queue instead of cron). However that’s usually several years away at this point and them being able to say “You don’t need that, all you need is…” is welcome. Then again, that’s probably why I got fired. :shrug:
I believe the reason is far more mundane: Complex systems are more interesting, with all the shiny knobs and levers and mysterious thingamabobs. Developers have a tendency to get nerd-sniped by interesting problems, and picking overly complex solutions to solve them at an abstract level scratches that itch very succinctly. In my experience, senior engineers learn to control this urge, and staff engineers can accurately decide when to break the rule and the complexity is warranted.
I’ve been in software for 20 years and it’s the first time I hear “back pressure”. Am I too old already?
> I’ve been in software for 20 years and it’s the first time I hear “back pressure”. Am I too old already?
I first wrote code 50 years ago (I am 63yo) so yes, imo we are too old, but ...
It is worth noting that systems concepts/techniques often have analogues aka different names and histories in different fields and subfields.
If I were to "explain" back pressure to an ordinary person I might model my analogy to the logic of this ~classic joke:
Bob: Let's go to Trendio(TM) for dinner tonight!? Carol: Oh, nobody goes there anymore, it's too crowded!
Also, often a modern take-this-for-granted concept may be seen as an outgrowth of previous problems or solutions.
For example back pressure is conceptually adjacent to the clever~hack/design of random backoff in Ethernet.
Or if talking to a math geek or traffic planner you might relate it to ~modern understanding of congestion including oddities like possibly removing roads/routes to ~paradoxically improve traffic flow.
We are deep in the Information Age barreling towards Singularities, so none of us, young or old, see and understand but a tiny fraction of where we've been, are, or might be going.
Cue Calvin & Hobbes cartoon of us racing downhill in a fragile box.
Perhaps, as others have essentially suggested, merging your mind with an ~AI will help (albeit temporarily, imo). I prefer to think of us/greybeards as potentially Wise, yet, paradoxically, clueless.
Beginner's Mind, with likely no time/future for Mastery, is still potentially pleasant, and I would argue useful for Debugging.
Obviously this modern AI tsunami is phase shifting us all into debug~mode anyway, eh?
Backpressure occurs at many levels, even down to a single machine doing something. If you ever have a producer and a consumer interacting and the consumer can’t consume as fast as the producer can produce, you need some way to have the producer pause or slow down until the consumer catches up. That’s back pressure.
> it’s the first time I hear “back pressure”. Am I too old already?
It's the opposite, as you get older you will feel this more and more.
It's a sign that you didn't get into the "let's distribute every problem" rabbit hole. I don't think it correlates with age.
But the keep the concept in your mind in case you have to distribute some problem. It's a central one.
You've just never played Factorio.
I have never played Factorio nor knew about it. It seems to be a very good game, thanks for the recommendation!
Unfortunately, it's too good. At least you'll learn all about backpressure in the days you spend lost to the world!
Yes
(but worry yea not, just like someone said of another term: "Dependency Injection" is a 25-dollar term for a 5-cent concept, something is similar for this term. )
Here’s a basic example https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/....
Services, systems, and/or databases eventually provide back pressure when they fail or get overloaded. The idea is to design in back pressure to let the system degrade gracefully rather than fail chaotically.
Somewhere surprising but if you never dealt with scaling issues of a certain nature it may have never came up.
Though you might be familiar with other terms that effectively mean the same thing, like counter pressure
> > Interviewer: "Well what about backpressure?"
> > "That's not really worth considering for this amount of QPS"
There is a good way and a bad way to communicate this in interviews.
If an interviewer is asking about back pressure, they’re prompting you to demonstrate your knowledge of back pressure and how and when it would be applied. Treating it as an opening to debate the validity of the question feels like dodging the question or attempting to be contrarian. Explaining when and where you would choose to add back pressure would be good, but then you should go on to answer the question.
This question hits close to home for me because I was once working at a small startup that was dealing with a unique problem where back pressure really was the correct way to manage one of our problems, but we had a number of candidates do exactly what you did: Scoff at the idea that such a topic would be relevant at a startup.
If we’ve been dealing with a problem for months and a candidate comes in and confidently tells us that problem isn’t something we would experience and dismisses our question, that’s not a positive signal.
> > Interviewer: "How would you choose between sql and nosql db?"
> > "Doesn't matter much. Whatever the team has most expertise in"
This is basically a softball question. Again, if you provide a non-answer or try to dismiss the question it feels like you’re either dodging the topic or trying to be contrarian. It’s also a warning sign to the interviewer that you might gravitate toward what’s easy for you instead of right for the project.
This one also resonates with me because I spent years of my life making MongoDB do things that would have been trivial if earlier developers had used something like SQLite instead. The reason they chose MongoDB? Because the team was familiar with it. It was hell to be locked into years of legacy code built around the wrong tool for the job because some early employees thought it didn’t matter “because startup”
As an interviewer, let me give some advice: If an interviewer asks a question, you should answer the question. Anything that feels like changing the subject, dodging the question, or arguing the merits of the question feels like the candidate either doesn’t understand the topic or wants to waste time by debating the question.
It can be very valuable to explain when and why a topic would become necessary, right before you explain it. Instead of “this application has low QPS and therefore I will not answer your question” (not literally what you said, but how it comes across) you could instead explain how the need for back pressure could be avoided first by scaling servers appropriately and then go on to answer the question that was asked.
Re: SQL vs NoSQL my take is that one should always start with SQL and get good at SQL, then if and when you ever find yourself with a need to scale that you can't meet in any way other than to use a NoSQL, then switch to NoSQL. Nine times out of ten you'll never need to switch.
Not only theory crafting during interviews but a lot of real life design is driven by what's known as resume driven development. The worst part - some of that is later presented at large conferences as successful and go-to solutions.
One time I was working in a body leasing company and our team was hired by bigco for an internal project. Two months earlier an internal employee was tasked to research the project and develop a prototype. When we started all major set pieces were written in stone. Month later said employee left. When we later checked the job listing he likely applied to our tech stack mirrored that to a letter. He got free training, a resume and a new job. We were stuck with these decisions for 3 years.
Another time a local branch of another bigco was trying to carve out a major piece of internal cake. Head-of was hired, team was quickly ramped up and they started cooking their foothold. Then a series of major power shifts happened couple levels above our pay grade and another branch came out with competitive strategy. We had a 2 days long internal brainstorm involving 50 people to come up with arguments and strategies how to defend our approach. We bet on blue, they were selling red. Life's were at stake. And many truly believed that blue was the way to go, and red was a recipe for disaster. Two days later we had a rock solid presentation that was trashing red approach. But if course most of these decisions are not made by nerds and middle mgmt do eventually the company placed their bet in red and the whole dept became redundant. No one likes to lose their jobs, so our blue head-of quickly turned his cloak and the team became an outsourcing provider for the winning team. What makes this story particularly funny is the fact that the head-of immediately started campaign of conference presentations where he sweard that all his life he believed that red was the future that will eventually trump blue, and any competition that is still using blue is destined to fail in short future.
I think the right tack, if you're going to dismiss something as needlessly complex, is to call out the circumstances that would make it necessary and then describe what you'd do under those conditions.
"Backpressure? I don't think you'll have enough traffic to make backpressure necessary. The mode of failure here is that you run out of queue space and start dropping messages, and it's not a big deal if some messages get dropped here. But if we do decide that dropped messages are causing problems, and if it starts becoming a regular occurrence (we'll set up observability), here's how the producer can poll the queue size and return an error to the user under heavy load.
You don't need to entirely forget this. I've made a habit of regularly seeking out job opportunities and interviewing even when I'm entirely happy with my job, which is to say I've done a ton of these kinds of interviews (on both sides of the table).
Unless the initial question requirements are insane (build Twitter at Twitter scale), I start with the smallest, dumbest thing that will work. Usually that's a single machine/VM talking to a database (or even just SQLite!). Compute and storage are so fast these days that you could comfortably run your fledgling service on a Raspberry Pi, even serving three or four-digit QPS depending on the workload.
Of course, you still have to "play the game" in the interview, so make sure to be clear about how you'd change this as requirements changed (higher QPS, more features, etc)
Tools that reduce the barrier to entry to creating things make it easier to solve problems with less scale to pay for the overhead. Generative AI is among these tools, but so are low code platforms, so is React, so is AWS, heck, so is the power grid. But in recent times generative AI is a big leap forward.
We’re at the start of another cycle of a lot of niche products followed by the rise of big Acme megacorps who conquer them all economies of scale that compete on margin. It comes just as we’re at the tail end of this cycle with tech as we knew it for the last 50 or so years.
Yes, and then you get the job there and regret it bc they’ll have either have an over engineered Rube Goldberg contraption or they have system envy bc they’ve read about this architecture in blogs and THINK they need K8s and they still fix all their problems.
I don't think they're completely not looking for that entire type of answer, but those examples are pretty dry and don't really go into the reasoning for your opinion, which is probably what they're worried about. Whenever you say something isn't worth considering or doesn't seem necessary, you should be explaining exactly why you think that, and exactly where it would be worth considering or seem necessary, because otherwise you just look like someone who simply doesn't care about whatever kind of scalability they're asking about.
You can always say “Since we’ve got only x QPS, I’m going to do A. If we had say y QPS, I’d do B but that would impact the rest of the design. Let me know if you anticipate growth to y and I can show you how I’d do it”
The point of an interview is to lay bare one’s thought process entirely so that the interviewer has full awareness of the person you are. And to likewise extract that from the interviewer. Getting or transmitting less information is just underutilizing the time. Interviewers are also flawed and may not be good enough at extracting the information from you.
If you’re an ideal decision maker, you will likely out-skill the majority of interviewers. You’re being hired to make their org succeed. So just do that.
I think people who describe system designs frequently fail to demarcate the space they’re operating in, so subsequent engineers cannot determine whether the original designer failed to consider something or whether the original designer considered and dismissed something. The point is to be able to express this concisely.
IMHO, doing it well means that not only do you get it right but you send the information down through time so that subsequent observers understand why and also get it right consequently.
Who does this? Why make something 10x as complicated as it needs to be, when you could just use the simple thing and get 10x as far? It's not like there's not enough work to do.
Who accumulates decades of legacy code?!
Real companies do. The moment you deploy one line of code, it's legacy. It goes from there. Soon you have to build systems that interface with other systems you'd rather were better architected and designed, except you have to deal with them as they are. Then your product becomes one of these, and with no need to maintain or expand it for a long time, it rots a bit, and now someone has to pick it up or interface with it, and your product made things more complex, and the complexity can't be magic wand waved away.
In some ways it's worse. There are also project review interviews. "We had a Rails/Django/whatever monolith that was backed by Postgres and we didn't need a SPA" makes for a less impressive session with many companies. This creates a lot of incentive to overcomplicate/"future proof" things for resume building.
This. There is also really no easy way of telling how an interviewer is thinking. One interviewer thought not having a warehouse in the design was a mistake, and the other one though having a caching solution made things too complex. It is completely a hit or miss with interviews
If you know that those are not the answers they are looking for, you can reasonably pass by modifying the answer only slightly, while still getting your point across.
If you can't, you might be getting interviewed by people you do not what to work with and you should want to know that.
Except these are the people in your way of getting that job that could be potentially life/career changing for you financially or otherwise. In this market or depending on your situation that would be hard to ignore.
I think that's a red herring. You are a knowledge worker. You are paid to disagree when necessary. Yes, people will probably take offense when you say "that's just a dumb question" but if they can't at least be approached when you offer your opinion in a palatable(sic) way, that's simply not going to work.
Understand what is being asked. Your insight on a topic is being tested. Offer an answer that does not read like a dodge or a coin flip.
Nothing against your content…but Kubernetes does manage Kubernetes.
That becomes obvious when you start bootstrapping an HA cluster with multiple control plane nodes.
K8s is not for the faint of heart…or rational system designers ;)
Answer what they want and finish with "but in practice it doesn't matter for this much traffic and would be wasted effort".
People ask for fizzbuzz in parallel not because it's practical.
You don’t want to work at a company like this anyway.
True, but people generally don't want to get evicted or have their utilities turned off either. If you need a job you need a job, and the numbers out of Cali's job market puts a lot of tech people in a position where they might not have the luxury of waiting for the "right fit". As always, YMMV and the world is a big place, everyone's different, yadda yadda.
If you want to get top dollar at a FAANG you will need to go through these type of system design interviews. You could say you shouldn’t work for a FAANG which is fair, but FAANG pays top dollar.
So if you're after the FAANG money, you have to play the FAANG interview game. If you're unwilling to play the FAANG interview game, then maybe you shouldn't be pursuing FAANG money.
It's not just FAANG that do these interviews anymore.
Tiny startups do them now as well.
Doing FAANG interviews is a necessary, but not sufficient condition to getting a FAANG salary. As others have said, non-FAANGs are trying to do FAANG interviews, and I would tell them to pound sand.
I think you might be missing the point.
Your answers are completely valid but you have to communicate to the interviewer that you considered the possibilities and the tradeoffs.
If the interviewer needs to "forcefully" extract from you the logic behind your design choices than a lot of times that's enough to fail you.
The interviewer is just another engineer trying to understand if you are someone who they can have a design discussion with.
Dismissive answers that assume they are needlessly over complicating things tells them exactly what they need to know
Why would someone ask about low QPS? Seems it would evoke the “whatever” answers you gave.
> SQL and NoSQL don’t matter much
Database is literally the most important architectural decision possible, next to the application programming language.
(Prove me wrong)
What a great article. It's always a treat to read this sort of take.
I have some remarks though. Taken from the article:
> Avoid having five different services all write to the same table. Instead, have four of them send API requests (or emit events) to the first service, and keep the writing logic in that one service.
This is not so cut-and-dry. The trade offs are far from obvious or acceptable.
If the five services access the database then you are designing a distributed system where the interface being consumed is the database, which you do not need to design or implement, and already supports authorization and access controls out of the box, and you have out-of-the-box support for transactions and custom queries. On the other hand, if you design one service as a high-level interface over a database then you need to implement and manage your own custom interface with your own custom access controls and constrains, and you need to design and implement yourself how to handle transactions and compensation strategies.
And what exactly do you buy yourself? More failure modes and a higher micro services tax?
Additionally, having five services accessing the same database is a code smell. Odds are that database fused together two or three separate databases. This happens a lot, as most services grow by accretion and adding one more table to a database gets far less resistance than proposing creating an entire new persistence service. And is it possible that those five separate services are actually just one or two services?
> the interface being consumed is the database, which you do not need to design or implement
You absolutely should design and implement it, exactly because it is now your interface. In fact, it will add more constraints to your design, because now you have different consumers and potentially writers all competing for the same resource with potentially different access patterns. Plus the maintenance overhead that migrations of such shared tables come with. And eventually you might have data in this table that are only needed for some of the services, so you now need to implement views and access controls at the DB level.
Ideally, if you have a chance to implement it, an API is cleaner and more flexible. The problem in most cases is simply business pushing for faster features which often leads to quick hacks including just giving direct access to some DB table from another service, because the alternative would take more time, and we don't have time, we want features, now.
But I agree with your thoughts in the last paragraph. It happens very often that people don't want to undertake the effort of a whole new design or redesign to match the evolving requirements and just patch it by adding a new table to an existing DB, then another,...
> In fact, it will add more constraints to your design, because now you have different consumers and potentially writers all competing for the same resource with potentially different access patterns. Plus the maintenance overhead that migrations of such shared tables come with. And eventually you might have data in this table that are only needed for some of the services, so you now need to implement views and access controls at the DB level.
PostgreSQL, to name one example, can handle every one of these challenges.
It's not that it is not possible, but whether it's a good idea.
The usual problem is that some team exposes one of their internal tables and they don't have control over what type of queries are run against it that could impact their service when the access patterns differ. Or when the external team is asking for extra fields that do not make sense for the owning team's model. Or adding some externally sourced information. Or the team moving from PostgreSQL to S3 or DynamoDB. And this is not an exhaustive list. An API layer is more flexible and can remain stable over a longer time than exposing internal implementation depending on a particular technology implemented in a particular way at the time they agreed on sharing.
This is, of course, not a concern inside the same team or very closely working teams. They can handle the necessary coordination. So, there are always exceptions and simple use cases where DB access works just fine. Especially, if you don't already have an API, which could be a bigger investment to set up for something simple if it's not even known yet the idea will work etc.
> Plus the maintenance overhead that migrations of such shared tables come with.
Moving your data types from SQL into another language solves exactly 0 migration problems.
Every migration you can hide with that abstraction language you can also hide in SQL. Databases can express exactly the same behaviors as your application code.
I’m generally pro SQL-as-interface, but this is just wrong.
Not only are there all sorts of bizarre constraints imposed by databases on migration behavior that application code can’t express (for example, how can I implement a transaction-plus-double-write pattern to migrate to use a new table because the locks taken to add an index to the old table require unacceptably long downtime? There are probably some SQL engines out there that can do this with views, but most people solve it on the client side for good reason), but there are plenty of changes that you just plain can’t do without a uniform service layer in front of your database. Note that “uniform service layer” doesn’t necessarily mean “networked service layer”, this can be in-process if you can prevent people from bypassing your querying functions and going directly to the DB.
> Note that “uniform service layer” doesn’t necessarily mean “networked service layer”, this can be in-process if you can prevent people from bypassing your querying functions and going directly to the DB.
You can take it one step further and implement the “uniform service layer” in the database itself - using stored procedures and views.
This has downsides, like strong coupling with the specific DBMS, and difficulty of development in a comparatively primitive SQL dialect, but protects the database from “naughty” clients and can have tremendous performance advantages in some cases.
As the sibling comment mentioned, this is a solved problem. MySQL and MariaDB take a very brief lock on the table at the very end of index creation that you will not notice, I promise. Postgres does the same if you use the CONCURRENTLY option for index builds.
If for some reason you do need to migrate data to a new table, triggers.
If somehow these still don’t solve your problem, ProxySQL or the equivalent (you are running some kind of connection pooler, right?) can rewrite queries on the fly to do whatever you want.
Either triggers or create index concurrently? Do most people solve that on the client side? Doesn't e.g. percona use triggers?
> And what exactly do you buy yourself?
APIs can be evolved much more easily than shared database schemas. Having worked with many instances of each kind of system, I think this outweighs all of the other considerations, and I don't think I'll ever again design a system with multiple services accessing the same database schema.
It was maybe a good idea if you were a small company in the early 2000s, when databases were well-understood and services weren't. After that era, I haven't seen a single example of a system where it wasn't a mistake for multiple services to access the same database schema (not counting systems where the read and write path were architecturally distinct components of the same service.)
I've implement an interesting service 15 years ago. Recently I've heard of it.
So this service was basically an "universal integration service". Company wanted to share some data, and they wanted to implement it in an universal way. So basically I've implemented SOAP web service which received request with SQL text and responded with list of rows. This service was surprisingly popular and used a lot.
I was smart enough, so I built a limited SQL syntax parser and UI, so administrator could just set up tables and columns they wanted to share for this specific client. SQL query was limited in a sense that it worked only with one table, simple set of columns and some limited conditions (that I bothered to implement).
The reason I've heard about it few months ago is that they shared with me, that they caught a malicious guy, who worked at some company integrating with this system and he tried to do SQL attack. They noticed errors in the logs and caught him.
Their database is pretty much done and frozen, regarding to schema. They hardly evolve it. So this service turned out ot be pretty backwards-compatible. And simple changes, of course, could be supported with view, if necessary.
Service specific views, my guy.
And when the underlying tables have to change, what then?
Views are good, and help with this situation. But if the data is complicated, big, and even somewhat frequently changes shape (DDL), views only help a little.
That said, I think that API update coordination is often much harder than schema change coordination (due to API behavior having many more dimensions along which it can change than database query behavior), so I am generally open to multiple services sharing a database—so long as it’s within reason and understood to pose risks/be appropriately justified and used responsibly.
100%. Views don't even cover all the use cases of schema evolution, unless you're willing to duplicate business logic between stored procedures and services, but schema evolution is only the start of it. API versioning gives you a lot more flexibility to evolve how data is stored and accessed. Some parts of your data might get shifted into other data stores, some functionality might get outsourced to third party APIs, you might have to start supporting third-party integrations, etc. Try doing that from a view -- or rather, please don't!
The goal is to minimize what needs changing when things need changing.
When you need to alter the datastore, usually for product or scalability, you have to orchestrate all access to that datastore.
Ergo: one only one thing using the datastore means less orchestration.
At work, we just updated a datastore. We had to move some tables to their own db. 3 years later, 40+ teams have updated their access. This was a product need. If this was a scale issue, the product would just have died sans some as of yet imagined solution.
A reused code library for DB use is an alternative there
That moves your API layer to the client library you need to distribute and build for your customers in programming languages they support. There are some cases where a thick client makes sense, but usually easier to do it server side and let customers consume the API from their env, it is easier to patch the server than to ship library updates to all users.
I think most of the discussion in this thread assumes that “customers” of the interface are other groups in the same organization using the database for a shared overarching business/goal, not external end user customers.
For external end users, absolutely provide an API, no argument here. The internal service interactions behind that API are a less simple answer, though.
It's definitely worse for external customers, of course. But it's still not that easy even for internal customers. The main problem is that usually the tables exposed are not meant to be public interfaces, so the team takes on an external dependency to their internal schema. And that other team could have completely different goals and priorities, speed and size, management and end users with different requirements. At some point the other team might start to ask the first team for adding some innocent looking fields to their internal table for them. Also first team might need to make changes to support their own service that might not be compatible with the other team. The other team making queries that are not in control of the team owning the DB, which could impact performance. If possible, it is better to agree on an API and avoid depending on internal implementations directly even for internal customers. There are always some exceptions, e.g. very close or subteams under same management and same customers could be fine. Or if the table in question was explicitly designed as a public interface, it is rare, but possible.
> Additionally, having five services accessing the same database is a code smell.
Counterpoint (assuming by database you mean database cluster, not a schema): having a separate physical DB for each service means that for most places, your reliability has now gone from N to N^M.
From which perspective? If a service is up, but is unable to do anything since another service is down, what good does it do other than increase some metrics on some dashboard. (Note that we are specifically talking about coupled services since the implication is writing to a single db being split up into multiple dbs - a distributed monolith).
Fair point. Unfortunately for me, the only kind of microservice architecture I’ve ever worked with is a distributed monolith - at multiple companies.
I think the author meant, in a general way, it’s better to avoid simultaneous writes from different services, because this is an easy way to introduce race conditions.
>And what exactly do you buy yourself? More failure modes and a higher micro services tax?
Nice boxes in the architectural diagram. Each box is handed to a different team and then, when engineers from those teams don't talk to each other, the system doesn't suddenly fail in an unexpected way.
At amzn a decision from atop was made that nobody would ever write in shared dynamo db tables. A team would own and provide APIs. That massively improved reliability and velocity.
The team boundary is very important. You can get away with shared DB for a long time if the same team handles all services that access it and have absolute tight control over them. If there are different teams in picture, however, the tight coupling is a source of problems and a bottleneck, beyond prototyping / idea validation, etc.
I don't need a decision from atop amazon to remind me how painful it would be to migrate a widely shared dynamo instance or god forbid change dax settings
> When querying the database, query the database. It’s almost always more efficient to get the database to do the work than to do it yourself. For instance, if you need data from multiple tables, JOIN them instead of making separate queries and stitching them together in-memory.
Oh yes! Never do a join in the application code! But also: use views! (and stored procedures if you can). A view is an abstraction about the underlying data, it's functional by nature, unlikely to break for random reasons in the future, and if done well the underlying SQL code is surprisingly readable and easy to reason about.
This is a big part of what makes ORMs a problem.
Writing raw SQL views/queries per MVC view in SSR arrangements is one of the most elegant and performant ways to build complex web products. Let the RDBMS do the heavy lifting with the data. There are optimizations in play you can't even recall (because there's so many) if you're using something old and enterprisey like MSSQL or Oracle. The web server should be able to directly interpolate sql result sets into corresponding <table>s, etc. without having to round trip for each row or perform additional in memory join operations.
The typical ORM implementation is the exact opposite of this - one strict object model that must be used everywhere. It's about as inflexible as you can get.
Most ORMs will happily let you map stored procedures and views to a class, you can have as many models as you want. So your point doesn't really make sense.
The author's said nothing about ORMs. It feels like you're trying to post a personal beef about ORMs that's entirely against the "pragmatic" software design engineering the author's opining. Using ORMs to massively reduce your boiler-plate CRUD code, then using raw SQL (or raw SQL + ORM doing the column mapping) for everything else is a pragmatic design choice.
You might not like them, but using ORMs for CRUD saves a ton of boilerplate, error-prone, code. Yes, you can footgun yourself. But that's what being a senior developer is all about, using the tools you have pragmatically and not foot gunning yourself.
And it's just looking for the patterns, if you see a massive ORM query, you're probably seeing a code smell. A query that should be in raw SQL.
In Go, for example, there is a mixed approach of pgx + sqlc, which is basically a combo of the best Postgres driver + type-safe code generator (based on raw SQL).
Even though I often use pgx only, for a new project, I would use the approach above.
I did some exploratory analysis on sqlc some time ago and I couldn't for the life of me figure out how to parametrize which column to sort-by and group-by in queries.
It is quite neat, but I don't think it actually replaces a proper ORM for when ORMs are actually useful for. That on top of all the codegen pitfalls.
I personally quite like the Prisma approach which doesn't map database data to objects, but rather just returns an array of tuples based on your query (and no lazy loading of anything ever). With typescript types being dynamically computed based on the queries. It has its own pitfalls as well (like no types when using raw queries).
The way you describe it, it would be ideal if ORMs would only handle very basic CRUD and force you to use raw sql for complex queries. But that's not reality and not how they are used, not always. In my opinion some devs take pride to do everything with their favorite ORM.
I think if an app uses 90% ORM code with the remains as raw queries, a junior is inclined to favor ORM code and is also less exposed to actually writing SQL. He is unlikely to become an SQL expert, but using SQL behind a code facade, he should become one.
And the ORM free code has massive downsides, not limited to if you add/change a column code can break at runtime, not compile time.
The negatives of not using an ORM is far worse than the negatives of not reigning in some developers who shouldn't be making complex queries.
If they don't even know how to check the SQL their complex ORM query produces, that's a training problem, not an ORM problem.
It's one of our great weaknesses as a profession, assuming everyone will figure stuff out on their own.
With an ORM your application code is your views.
You can write reusable plain functions as abstractions, returning QuerySets that allow further filters being chained onto the query, before the actual SQL is materialized and sent to the database.
The result of this doesn’t have to match the original object models you defined, it’s still possible to be flexible with group bys resulting in dictionaries.
But converting a SQL relation to a set of dictionaries already carries a lot of overhead: every cell in the resultset must be converted to a key-value pair. And the normal mechanics of vertical "slicing" a set of dictionaries is much more expensive than doing the same in a 2d relation array. So while you might want to offer a dictionary-like interface for the result set, please don't use a dictionary-like data structure.
There are valid reasons to avoid complex ORM/query result representations, but this isn’t one of them.
I have very rarely seen or even heard of the result representation data structure for an SQL query being a bottleneck. The additional time and space needed to represent a raw tabular result in a more useful way on the client are nearly always rounding errors compared by the time needed to RPC the query itself and the space taken up by the raw bytes returned. Given that, and the engineering time wasted working with (and fixing inevitable bugs in) fully tabular result data structures (arrays, bytes), this is bad advice.
Unpopular opinion. ORM by definition is the gcd of "supported databases" features. It exists only because people doesn't like the aesthetics of SQL but the cost to use them is immense.
Not unpopular. ORM hate is real. I like SQL Alchemy and Drizzle in projects for the features they give you for free (such as Alembic migrations and instant GraphQL server), but I still write SQL for most stuff.
If your ORM is going to the DB per row you're using it wrong. N+1 queries are a performance killer. They are easy to spot in any modern APM.
Rails makes this easy to avoid. Using `find_each` batches the queries (by 1,000 records at a time by default).
Reading through the comment section on this has been interesting. Either lots of people using half baked ORMs, people who have little experience with an ORM, or both.
I mean Rails also makes it easy to accidentally nest further queries inside your `find_each` block and end up with the same problem.
Your team can have rules and patterns in place to mitigate it but I'd never say "Rails makes this easy to avoid".
This is true with any any interaction with the DB, ORM or otherwise. Regardless of the layer of abstraction you choose to operate at you still need to understand the underlying complexity.
What Rails gives you is easy to use (and understand) abstractions that enable you to directly address performance issues.
Easy is highly contextual here, because none of this is trivial.
I think the real value in frameworks like rails and Django is that it makes it easier to collaborate. When you do it from scratch people inevitably write their own abstractions and then you can't share code so easily.
Even in the article the solution wasn’t to abandon the ORM in favor of raw SQL but knowing how to write the code so it doesn’t have to run 100 extra queries when it doesn’t need to.
> Particularly if you’re using an ORM, beware accidentally making queries in an inner loop. That’s an easy way to turn a select id, name from table to a select id from table and a hundred select name from table where id = ?.
>The typical ORM implementation is the exact opposite of this - one strict object model that must be used everywhere. It's about as inflexible as you can get.
I can't respond to the "typical" part as most of my experience is using EF Core, but it's far from inflexible.
Most of my read-heavy, search queries are views I've hand written that integrate with EF core. This allows me to get the benefit of raw SQL, but also be able to use LINQ to do sorting/paging/filtering.
Have you ever build a complex app like this?
In particular, have you have to do testing, security (eg. row level security), manage migrations, change management (eg. for SOC2 or other security frameworks), cache offloads (Redis, and friends), support for microservices, etc.
Comments like this give me a vibe of young developers trying out Supabase for the first time feeling like that approach can scale indefinitely.
> Comments like this give me a vibe of young developers
I don’t think so. The context is about avoiding joining in memory, which is fairly awful to do in a application, and should be avoided, along with uninformed use of ORMs, which often just add a layer of unwarranted complexity leading to things like the dreaded N+1 problem that most inexperienced Rails developers had when dealing with ActiveRecord.
If anything, what you’re talking about sounds like development hell. I can understand a database developer having to bake in support for that level of security, but developing an app that actually uses it gets you so far in the weeds that you can barely make progress trying to do normal development.
A developer with several years of experience or equivalent will have pride in developing complexity and using cool features that make them feel important.
After a developer has maybe twice that many years experience or equivalent, they may develop frameworks with the intent to make code easier to develop and manage.
And beyond that level of experience, developers just want code that’s easy to maintain and doesn’t make stupid decisions like excessive complexity. But, they know they have to let the younger devs make mistakes, because they don’t listen, so there is no choice but to watch hell burn.
Then you retire or get a different job.
I don't know what I am talking about that sounds like hell?
I am merely talking about properties of developing complex web applications that have traditionally not been easy to work with in SQL.
I am in particular not proposing any frameworks.
How can that sound like hell?
Not the person you replied to, but I have! A java project I worked on a couple years ago used a thin persistence layer called JOOQ (java library). It basically helps you safely write sql in java, without ORM abstractions. Worked just fine for our complex enterprise app.
Sql migrations? This is a solved problem: https://github.com/flyway/flyway
What about micro services? You write some terraform to provision a sql database (e.g. aws aurora) just like you would with dynamo db or similar. What does that have to do with ORMs?
What about redis? Suddenly we need an ORM to query redis, to check if a key exists in the cache before hitting our DB? That’s difficult code to write?
I’m confused reading your comment. It has “you don’t do things my way so you must be dumb and playing with toy projects” vibes.
As a previous user of alembic I was surprised that flyway's migrations only go forward by default and that reversing them is a premium feature. That's like having the luxury trim being the one with seatbelts.
it’s been a while since I used flyway. is there a better option in 2025? Just curious.
From what I can se jooq is only really type safe with pojo mappings, to what point it is an orm with an expressive query dsl.
Alternatively you use record style outputs, but that is prone to errors if positions are changed.
Regardless, even with jooq you still accept that there is a sizable application layer to take responsibility of the requirements I listed.
i guess it’s semantics, but i agree with you actually. After all ORM = object relational mapping. However it’s certainly the most lightweight ORM i’ve used in the java and c# world. With JOOQ you are in complete control of what the SQL statements look like and when those queries happen (avoids the common N + 1 risk). _Most_ ORMs i’ve seen attempt to abstract the query from the library user.
In our project we generated pojo’s in a CI pipeline, corresponding to a new flyway migration script. The pojos were pushed to a dedicated maven library. This ensured our object mappings were always up to date. And then we wrote sql almost like the old fashioned way…but with a typesafe java DSL.
I don't understand why all these problems should be easier handled with an ORM then with raw sql?
Why is it so hard to believe that well tested, typed code is better than manual string concatenation?
Before you tell me about how you just use a Query Builder/DSL and a object mapper for convenience: That's a freaking ORM!
It is a granluarity tradeoff.
With SQL you need to explicitly test all queries where the shape granularity is down to field level.
When you map data onto an object model (in the dto sense, not oop sense) you have bigger building blocks.
This gives a simpler application that is more reliable.
Obviously you need to pick a performant orm - and it seems a lot of people in these threads have been traumatized.
Personally, I run a complex application where developers freely use a graphql schema and requests are below 50ms p99 - gql in translated into joins by the orm, so we do not have any n+1 issues, etc.
The issue with GraphQL tends to be unoptimized joins instead. Is your GraphQL API available for public consumers? How do you manage them issuing inefficient queries?
I've most often seen this countered through data loaders (batched queries that are merged in code) instead of joins, or query whitelists.
While this api in particular is not publicly exposed, that would not be a concern.
The key is to hold the same schema on the database as on the graphql and use tooling that can translate a gql query into a single query.
The issue I've seen with GraphQL isn't necessarily the count of queries run, but rather the performance or said queries (i.e. most SQL queries are not performant without proper indexes for the specific use case, but GraphQL allows lots of flexibility in what queries users can run.)
Yes - one needs to ensure that the data is well indexed - that is reasonable.
But indices does not need to yield a single result. It is OK that indices reduce the result set to tens or couple of hundreds of result. That is well within the performance requirements (... of our app)
In my ears that's just neglect? You assume your ORM does the basic data mapping right and don't verify it?
> You assume your ORM does the basic data mapping right
You know, it should. There's no good reason for an ORM to ever fail at runtime due to mapping problems instead of compile time or start time. (Except, of course if you change it during the software's execution.)
No? The difference is to verify it ones for the orm VS ones for every single place your query.
I have to respond here as I seemingly the depth limit is reached.
As you've mentioned graphql you probably comparing ORM in that sense to an traditional custom API with backed by raw sql. In a fair comparison both version would do the exactly same, require the same essential tests. Assuming more variations for the raw sql version is just assuming it does more or somehow does it badly in terms of architecture. Which is not a fair comparison.
The orm represents deferred organization. Ie someone else is testing mapping and query generation for you.
An example is prisma. Prisma has a team og engineers that work on optimizing query generation and provide a simple and intuitive api.
Not using an orm forces you to take over that organization and test that extra complexity that goes into you code base.
It might be merited if you get substantiel performance boosts - but I have not seen any reasonably modern orm where performance is the issue.
A raw query doesn't has to be repeated in every place it's required. Not sure what your point is.
You will have a bigger variety of queries hwne you don't use an orm - this puts a higher load on software testing to get the same level of reliability.
> 50 ms p99
You realize that’s abysmally bad performance for any reasonable OLTP query, right? Sub-msec (as measured by the DB, not including RTT etc.) is very achievable, even at scale. 2-3 msec for complex queries.
The is the response time for the server, not the database - which is appears that everyone but you understood clearly from the context.
C#’s Linq based ORMs have always been - type safe built into the OS feature -> run time generation of an agnostic expression tree -> database provider converts it into SQL. It does database joins (unless you do something stupid like get out of IQuery land).
Stored procedures seem like a win but the big problem is that while I could write the rest of the software in a very nice modern language like Rust, or more practically in C# since my team all know C# if I write a stored procedure it will be in Transact-SQL because that's the only choice.
T-SQL was not a good programming language last century when it was vaguely current, and so no I do not want to write any significant amount of code in T-SQL. For my sins I maintain a piece of software with huge T-SQL procedures (multi-page elaborations by somebody who really, really like this stuff) and they're a nightmare. The tooling doesn't really believe in version control, the diagnostics when you make a mistake are either non-existent or C++ style useless spew.
We hire a lot of very junior developers. People who still need to be told not to comment out code in release, that variable numbers are for humans to read not machines, that sort of thing. We're not quite hiring physicists to write software (I have done that at a startup) but it's close. However, none of the poor "My first program" code I see in a merge request by a new hire is anywhere close to as unreadable as the T-SQL we already own and maintain.
I've only once tried to use stored procedures in mysql and it was almost impossible to debug back then. Very painful. Average devs already have issues being smart with their databases and stored procedures would add to that.
Stored procedures also add another risk. You have to keep them in sync with code, making releases more error prone. So you have to add extra layers of complexity to manage versioning.
I can see the advantage of extreme performance/efficiency gains, but it should be really big to be justified.
I'm a big postgres guy and in theory I love stored procedures (so many language options!) but you're 100% right that the downsides in terms of DX make them pretty much the last thing I reach for unless they're a big performance/simplicity win and I expect them to be pretty static over time.
> Stored procedures also add another risk. You have to keep them in sync with code, making releases more error prone.
This one is easily solved: never change a stored procedure. Every version should get a new name.
I worked at a place with just such a system. Half the application code was baked into sprocs, no version control and hidden knock on effects everywhere.
There was _one guy_ who maintained it and understood how it worked. He was very smart but central to the company’s operations. So having messy stuff makes it brittle/hard to change in more ways than one and
I disagree. In modern highly scalable architectures I’d prefer doing joins in the layer front of the database (backend).
The “backend” scales much easier than the database. Loading data by simple indexes, eg. user_id, and joining it on the backend, keeps the db fast. Spinning up another backend instance is easy - unlike db instance.
If you think, your joins must happen in db, because data too big to be loaded to memory on backend, restructure it, so it’s possible.
Bonus points for moving joins to the frontend. This makes data highly cacheable - fast to load, as you need to load less data and frees up resources on server side.
High Scale is so subjective here, I'd hazard a guess that 99% of businesses are not at the scale where they need to worry about scaling larger than a single Postgres or MySQL instance can handle.
In the case of one project I've been in, the issue was the ORM creating queries, which Postgres deemed too large to do in-memory, so it fell back to performing them on-disk.
Interestingly it didn't even use JOIN everywhere it could because, according to the documentation, not all databases had the necessary features.
A hard lesson in the caveats of outsourcing work to ORMs.
I've worked both with ORMs and without. As a general rule, if the ORM is telling you there is something wrong with your query / tables it is probably right.
The only time I've seen this is my career was a project that was an absolute pile of waste. The "CTO" was self taught, all the tables were far too wide with a ton of null values. The company did very well financially, but the tech was so damn terrible. It was such a liability.
Scalability is not the keyword here.
The same principle applies to small applications too.
If you apply it correctly, the application never going to be slow due to slow db queries and you won’t have to optimize complex queries at all.
Plus if you want to split out part of an app to its own service, it’ll be easily possible.
One of the last companies I worked at had very fast queries and response times doing all the joins in-memory in the database. And that was only on a database on a small machine with 8GB RAM. That leaves a vast amount of room for vertical scaling before we started hitting limits.
Vertical scaling is criminally underrated, unfortunately. Maybe, it's because horizontal scaling looks so much better on Linkedin.
Sooner or later even small apps reach hardware limits.
My proposed design doesn’t bring many hard disadvantages.
But it allows you to avoid vertical hardware scaling.
Saves money and development time.
Not really disagreeing with you here, but that "later" never comes for most companies.
My manufacturing data is hundreds of GB to a few TB in size per instance and I am talking about hot data, that is actively queried. It is not possible to restructure and it is a terrible idea to do joins in the front end. Not every app is tiny.
In some cases, it’s true.
But your thinking is rather limited. Even such data can be organized in a way, that joins are not necessarily in the db.
This kind of design always “starts” on the frontend - by choosing how and what data will be visible eg. on a table view.
Many people think, showing all data, all the time is the only way.
The SQL database has more than a dozen semi-independent applications that treat different aspects of the manufacturing process, for example from recipes and batches to maintenance, scrap management and raw material inventory. The data is interlocked, the apps are independent as different people in very different roles are using it. No, it never starts in the front end, it started as a system and evolved by adding more data and more apps. Think SAP as another such example.
This is and “old-school” design. Nowadays I wouldn’t let apps meet in the database.
Simple service oriented architecture is much preferred. Each app with its own data.
Then such problems can be easily avoided.
It’s not old school, it’s actually solid design. I have worked too with people that think the frontend or even services should guide the design/architecture of the whole thing. Seems tempting and it has the initial impression that it works, but long terms it’s just bad design. Having Data structures (and mainly this means database structures) stable is key to long term maintenance.
> Seems tempting and it has the initial impression that it works, but long terms it’s just bad design.
This appears as an opinion rather than an argument. Could you explain what you find bad about the design?
In any case, I believe a DB per backend service isn't a decision driven by the frontend - rather, it's driven by data migration and data access requirements.
It's an opinion based on countless of references and books out there. I cannot cite them, but it's like "code should be designed to depend on abstract interfaces instead of a concrete implementation", "everything is a byte stream", "adding more people to a late project makes it later", "Bad programmers worry about the code. Good programmers worry about data structures and their relationships", "Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious.", etc... they are usually true.
> In any case, I believe a DB per backend service isn't a decision driven by the frontend - rather, it's driven by data migration and data access requirements.
I think the idea of breaking up a shared enterprise DB into many distinct but communicating and dependent DB's was driven by a desire to reduce team+system dependencies to increase ability to change.
While the pro is valid and we make use of the idea sometimes when we design things, the cons are significant. Splitting up a DB that has data that is naturally shared by many departments in the business and by many modules/functional areas of the system increases complexity substantially.
In the shared model, when some critical attribute of an item (sku) is updated, then all of the different modules+functional areas of enterprise are immediately using that current and correct master value.
In the distributed model, there is significant complexity and effort to share this state across all areas. I've worked on systems designed this way and this issue frequently causes problems related to timing.
As with everything, no single solution is best for all situations. We only split this kind of shared state when the pros outweigh the cons, which is sometimes but not that often.